You can get more detailed information about a command by typing it in the Console with a question mark in front. For example, ?prop.test
.
In general, each command produces an object or a list of objects related to the output for future access. You can save the output of a command by giving it a name. See the first examples for prop.test
under the heading Inference for Proportions.
Sex | Height | Weight |
---|---|---|
M | 66 | 153 |
F | 62 | 124 |
M | 73 | 210 |
M | 69 | 160 |
A dataset in R is call a data frame. We create a data frame df
for the above table using the following command.
df = data.frame(Sex = c("M","F","M","M"),
Height = c(66,62,73,69),
Weight = c(153,124,210,160))
Show df
.
df
## Sex Height Weight
## 1 M 66 153
## 2 F 62 124
## 3 M 73 210
## 4 M 69 160
read.csv(url(“URL address”))
The dataset ICU
below is created by reading the a .csv file when it has an URL. The code book for the dataset can be found HERE
ICU = read.csv(url("http://www.lock5stat.com/datasets/ICUAdmissions.csv"))
head(ICU)
## ID Status Age Sex Race Service Cancer Renal Infection CPR Systolic
## 1 8 0 27 1 1 0 0 0 1 0 142
## 2 12 0 59 0 1 0 0 0 0 0 112
## 3 14 0 77 0 1 1 0 0 0 0 100
## 4 28 0 54 0 1 0 0 0 1 0 142
## 5 32 0 87 1 1 1 0 0 1 0 110
## 6 38 0 69 0 1 0 0 0 1 0 110
## HeartRate Previous Type Fracture PO2 PH PCO2 Bicarbonate Creatinine
## 1 88 0 1 0 0 0 0 0 0
## 2 80 1 1 0 0 0 0 0 0
## 3 70 0 0 0 0 0 0 0 0
## 4 103 0 1 1 0 0 0 0 0
## 5 154 1 1 0 0 0 0 0 0
## 6 132 0 1 0 1 0 0 1 0
## Consciousness
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
.csv
file extension stands for “comma separated values.” You can create such a file easily by creating a spreadsheet in a program like Excel or Google Sheets, and save it in .csv format.
First, locate your working directory.
getwd()
## [1] "/cloud/project"
After putting your file into the working directory, use read.csv
to read it into the dataset ICU
.
ICU = read.csv("ICUAdmissions.csv")
head(ICU)
## ID Status Age Sex Race Service Cancer Renal Infection CPR Systolic
## 1 8 0 27 1 1 0 0 0 1 0 142
## 2 12 0 59 0 1 0 0 0 0 0 112
## 3 14 0 77 0 1 1 0 0 0 0 100
## 4 28 0 54 0 1 0 0 0 1 0 142
## 5 32 0 87 1 1 1 0 0 1 0 110
## 6 38 0 69 0 1 0 0 0 1 0 110
## HeartRate Previous Type Fracture PO2 PH PCO2 Bicarbonate Creatinine
## 1 88 0 1 0 0 0 0 0 0
## 2 80 1 1 0 0 0 0 0 0
## 3 70 0 0 0 0 0 0 0 0
## 4 103 0 1 1 0 0 0 0 0
## 5 154 1 1 0 0 0 0 0 0
## 6 132 0 1 0 1 0 0 1 0
## Consciousness
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
In this example, we want to create the two-way table below in R, call it GenderSeat
.
Front | Middle | Back | |
---|---|---|---|
Female | 37 | 91 | 22 |
Male | 15 | 46 | 25 |
WE first create GenderSeat
as a matrix by entering the values in the two-way table by rows.
GenderSeat = matrix(c(37, 91,22, 15, 46, 25), ncol = 3, byrow = TRUE)
Then we add column and row headings to the matrix
colnames(GenderSeat) = c("Front", "Middle", "Back")
rownames(GenderSeat) = c("Female", "Male")
Finally, we convert GenderSeat
into a table.
GenderSeat = as.table(GenderSeat)
We can display the table.
GenderSeat
## Front Middle Back
## Female 37 91 22
## Male 15 46 25
table(row_var, col_var)
The code book for the dataset ICU
an be found HERE.
SexRace = table(ICU$Sex,ICU$Race) ## Summarize the variables Sex and Race in the dataset ICU as a two-way table, with Sex in rows and Race in columns.
SexRace
##
## 1 2 3
## 0 108 10 6
## 1 67 5 4
We can look at the code book to know what the codes for Sex and Race represent, and update the table.
colnames(SexRace) = c("white", "black", "other")
rownames(SexRace) = c("male", "female")
SexRace
##
## white black other
## male 108 10 6
## female 67 5 4
We wil illustrate some R commands using the dataset ICU
, the code book for which can be found HERE.
mean(ICU$HeartRate)
## [1] 98.925
sd(ICU$HeartRate)
## [1] 26.82962
median(ICU$HeartRate)
## [1] 96
summary(ICU$HeartRate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 39.00 80.00 96.00 98.92 118.25 192.00
Other functions include range
, iqr
, max
, etc.
The formula relating two variables has the form Response~Explanatory. In this case, since we are using the categorical variable to explain the differences in the quantitative variable, the formula has the form Quantitatve~Categorical.
The following finds the mean heart rate for each group of male and female patients.
mean(HeartRate ~ Sex, data=ICU)
## 0 1
## 98.09677 100.27632
By referring to the code book, the output means that heart rates of male patients averaged about 98 beats per minute, while that of female patents averaged about 100 bpm.
The formula relating two quantitative variables has the form Response~Explanatory.
cor(HeartRate ~ Systolic, data=ICU) ## Correlation coefficient r.
## [1] -0.05658246
cor(HeartRate ~ Systolic, data=ICU)^2 ## Coefficient of determination r^2
## [1] 0.003201575
To find the regression equation, we use lm
, which stands for “linear model.”
lm(HeartRate ~ Systolic, data=ICU)
##
## Call:
## lm(formula = HeartRate ~ Systolic, data = ICU)
##
## Coefficients:
## (Intercept) Systolic
## 105.01907 -0.04607
## Another way to write the command is
## lm(ICU$HeartRate ~ ICU$Systolic)
The output means that the regression equation is
HeartRate = 105.01907-0.04607(Systolic)
We will use the following table as example. To see how to create a two-way table, see Section
GenderSeat
## Front Middle Back
## Female 37 91 22
## Male 15 46 25
Add margins to the table.
addmargins(GenderSeat)
## Front Middle Back Sum
## Female 37 91 22 150
## Male 15 46 25 86
## Sum 52 137 47 236
Table of percents:
totPercents(GenderSeat,1) ## Total percents. Second argument indicates the number of decimal places.
## Front Middle Back Total
## Female 15.7 38.6 9.3 63.6
## Male 6.4 19.5 10.6 36.4
## Total 22.0 58.1 19.9 100.0
rowPercents(GenderSeat,0)
## Front Middle Back Total Count
## Female 25 61 15 101 150
## Male 17 53 29 99 86
colPercents(GenderSeat,2)
## Front Middle Back
## Female 71.15 66.42 46.81
## Male 28.85 33.58 53.19
## Total 100.00 100.00 100.00
## Count 52.00 137.00 47.00
Mosaic plots display conditional distributions associated with a two-way table of row/column percents. Let’s use the following table as example
GenderSeat
## Front Middle Back
## Female 37 91 22
## Male 15 46 25
Below shows a table of row percents and its associated mosaic plot.
rowPercents(GenderSeat,1)
## Front Middle Back Total Count
## Female 24.7 60.7 14.7 100.1 150
## Male 17.4 53.5 29.1 100.0 86
mosaicplot(GenderSeat, main = "Conditional Distribution of Seating by Gender",
xlab = "Seating", ylab = "Gender", dir = c("h","v"))
Below shows a table of column percents and its associated mosaic plot.
colPercents(GenderSeat,2)
## Front Middle Back
## Female 71.15 66.42 46.81
## Male 28.85 33.58 53.19
## Total 100.00 100.00 100.00
## Count 52.00 137.00 47.00
mosaicplot(GenderSeat, main = "Conditional Distribution of Gender by Seating",
xlab = "Seating", ylab = "Gender", dir = c("h","v"), sort=c(2,1))
prop.test(x, n, p = NULL, alternative = c(“two.sided”, “less”, “greater”), conf.level = 0.95, data = NULL, success = NULL, …)
correct = FALSE
to match the formula you learned in class.X-squared
is the square of the z-score, which is stored as statistic
in the output list.Suppose your data has 14 successes out of 50. The following performs a two-sided HT with null value 0.3.
prop.test(x=14, n=50, p=0.3, correct = FALSE)
##
## 1-sample proportions test without continuity correction
##
## data: 14 out of 50
## X-squared = 0.095238, df = 1, p-value = 0.7576
## alternative hypothesis: true p is not equal to 0.3
## 95 percent confidence interval:
## 0.1747417 0.4166512
## sample estimates:
## p
## 0.28
Save the result of the test:
result = prop.test(x=14, n=50, p=0.3, correct = FALSE)
You can see what values are store in result
:
summary(result)
## Length Class Mode
## statistic 1 -none- numeric
## parameter 1 -none- numeric
## p.value 1 -none- numeric
## estimate 1 -none- numeric
## null.value 1 -none- numeric
## conf.int 2 -none- numeric
## alternative 1 -none- character
## method 1 -none- character
## data.name 1 -none- character
Access a stored value by using $
after object name, followed by the name of the value:
result$statistic
## X-squared
## 0.0952381
To get the absolute value of z-score:
z=sqrt(result$statistic); names(z)="|Z-score|"; z
## |Z-score|
## 0.3086067
You have to choose the sign based on whether your sample estimte is above or below the null value. In this example, since p-hat (0.28) is less than null value (0.3), the z-score is really negative (z=-0.3086067).
You can get the confidence interval without the output of hypothesis test by using $conf.int
. For CI, the null value for HT is irrelevant, so you can just leave it out.
prop.test(x=14, n=50, conf.level=0.99, correct = FALSE)$conf.int
## [1] 0.1499462 0.4616007
## attr(,"conf.level")
## [1] 0.99
The following performs a right-tail HT with the default null value of 0.5 by omitting it in the arguments of prop.test
.
prop.test(x=14, n=50, alternative="great", correct = FALSE)
##
## 1-sample proportions test without continuity correction
##
## data: 14 out of 50
## X-squared = 9.68, df = 1, p-value = 0.9991
## alternative hypothesis: true p is greater than 0.5
## 95 percent confidence interval:
## 0.1889395 1.0000000
## sample estimates:
## p
## 0.28
The following tests the alternative hypothesis that p1-p2 is greater than zero, with p-hat1 = 14/50 and p-hat2 = 30/100.
prop.test(x=c(14, 30), n=c(50, 100), alternative = "greater", correct = FALSE)
##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: c(14, 30) out of c(50, 100)
## X-squared = 0.064322, df = 1, p-value = 0.6001
## alternative hypothesis: greater
## 95 percent confidence interval:
## -0.1488037 1.0000000
## sample estimates:
## prop 1 prop 2
## 0.28 0.30
The code book for the data loaded below can be found HERE.
The data is loaded directly from an URL into the data frame df
. df$Sex
refers to the variable sex. In the code book, males are coded as “0”. The argument success
is used to identified which the sample poportion is studied. Thus, the example below is about proportion of males in the dataset, using null value of 50%.
prop.test(ICU$Sex, p=0.5, success = "0") ## For some reason, the argument "correction = FALSE" doesn't work when raw data is used.
##
## 1-sample proportions test with continuity correction
##
## data: ICU$Sex [with success = 0]
## X-squared = 11.045, df = 1, p-value = 0.0008893
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.5485320 0.6867787
## sample estimates:
## p
## 0.62
The following tests to see if the proportion of Black patients among males is different than that among females. See code book.
prop.test(Race ~ Sex, data = ICU, success = "2")
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: tally(Race ~ Sex)
## X-squared = 0.012236, df = 1, p-value = 0.9119
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.06926372 0.09897510
## sample estimates:
## prop 1 prop 2
## 0.08064516 0.06578947
The problem with the above is that the output does not identify clearly which sample estimates is male or female. We can get around this problem by creating a table with Sex and Race and look at the conditional distribution of Race by Sex.
RaceSex = table(ICU$Race, ICU$Sex) ## Sex are in columns.
colPercents(RaceSex)
##
## 0 1
## 1 87.1 88.2
## 2 8.1 6.6
## 3 4.8 5.3
## Total 100.0 100.1
## Count 124.0 76.0
We can see that, for the sample esimate in the output of prop.test
, prop 1
with 0.0806 refers to proportion of Blacks among male (code 0) patients, matching the 8.1% in the table of column percents.
tsum.test(mean.x, s.x = NULL, n.x = NULL, mean.y = NULL, s.y = NULL, n.y = NULL, alternative = “two.sided”, mu = 0, var.equal = FALSE, conf.level = 0.95)
The following example matches the example in OLI on p186, where the sample mean is 13.5, sample sd is 3.2, and sample size is 45. The alternative hypothesis is that the population mean is greater than 12.
tsum.test(13.5, 3.2, 45, alternative = "greater", mu = 12)
## Warning in tsum.test(13.5, 3.2, 45, alternative = "greater", mu = 12):
## argument 'var.equal' ignored for one-sample test.
##
## One-sample t-Test
##
## data: Summarized x
## t = 3.1445, df = 44, p-value = 0.001489
## alternative hypothesis: true mean is greater than 12
## 95 percent confidence interval:
## 12.69848 NA
## sample estimates:
## mean of x
## 13.5
The following example matches the example in OLI on p219 having the following sample statistics.
Age | sample mean | sample sd | sample size |
---|---|---|---|
20-29 | 83.4 | 18.7 | 712 |
75+ | 78.5 | 19.0 | 1001 |
We test with the alterntive hypothesis that younger men weigh on average more than older men.
tsum.test(83.4, 18.7, 712, 78.5, 19, 1001, alternative = "greater")
##
## Welch Modified Two-Sample t-Test
##
## data: Summarized x and y
## t = 5.3092, df = 1545.9, p-value = 6.305e-08
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 3.381025 NA
## sample estimates:
## mean of x mean of y
## 83.4 78.5
t.test(x, y = NULL, alternative = c(“two.sided”, “less”, “greater”), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, …)
t.test(formula, data, subset, na.action, …)
\(H_0:\mu=80\)
\(H_a:\mu>80\)
t.test(ICU$HeartRate, alternative="greater", mu=80)
##
## One Sample t-test
##
## data: ICU$HeartRate
## t = 9.9755, df = 199, p-value < 2.2e-16
## alternative hypothesis: true mean is greater than 80
## 95 percent confidence interval:
## 95.78989 Inf
## sample estimates:
## mean of x
## 98.925
To find a 90% confidence interval, do a two sided t-test, and you can find the CI in the output. It won’t matter what the null value is, so you can skip it, which actually uses null value of zero.
t.test(ICU$HeartRate, conf.level = 0.90)
##
## One Sample t-test
##
## data: ICU$HeartRate
## t = 52.144, df = 199, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
## 95.78989 102.06011
## sample estimates:
## mean of x
## 98.925
\(H_0:\mu_{male}=\mu_{female}\)
\(H_a:\mu_{male}\ne\mu_{female}\)
t.test(HeartRate ~ Sex, data=ICU)
##
## Welch Two Sample t-test
##
## data: HeartRate by Sex
## t = -0.55481, df = 157.04, p-value = 0.5798
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -9.938963 5.579880
## sample estimates:
## mean in group 0 mean in group 1
## 98.09677 100.27632
The problem with the above is that you have no control over which sex is the first or second group. To get around that, you can explicitly choose as in the following example:
\(H_0:\mu_{female}=\mu_{male}\)
\(H_a:\mu_{female}\gt\mu_{male}\)
The code book shows that 1 represents female and 0 represents male.
t.test(ICU$HeartRate[ICU$Sex == "1"], ICU$HeartRate[ICU$Sex == "0"], alternative = "greater")
##
## Welch Two Sample t-test
##
## data: ICU$HeartRate[ICU$Sex == "1"] and ICU$HeartRate[ICU$Sex == "0"]
## t = 0.55481, df = 157.04, p-value = 0.2899
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -4.320533 Inf
## sample estimates:
## mean of x mean of y
## 100.27632 98.09677
The data for each matched pair is in the same row.
The code book for the data loaded below can be found HERE.
chimp=read.csv(url("http://www.chabotcollege.edu/faculty/mho/MTH43/data/ChimpanzeeCollaborate.csv"))
t.test(chimp$ColYes, chimp$ColNo, alternative="greater", paired = TRUE) ## The difference is the first argument minus the second argument.
##
## Paired t-test
##
## data: chimp$ColYes and chimp$ColNo
## t = 7.3703, df = 7, p-value = 7.663e-05
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 9.193915 Inf
## sample estimates:
## mean of the differences
## 12.375
The folowing example peforms the goodness-of-fit test on the second Learning by Doing titled Hypothesis Test about the Color Distribution for Plain M&Ms on p202 of OLI.
chisq.test(c(38,32,51,58,74,47), p=c(0.13,0.13,0.14,0.24,0.20,0.16))
##
## Chi-squared test for given probabilities
##
## data: c(38, 32, 51, 58, 74, 47)
## X-squared = 9.2203, df = 5, p-value = 0.1006
First, get your two-way table into R. We will use the table GenderSeat
that was created in the Section 1.2.
A table summary automatically performs a chi-square test
summary(GenderSeat)
## Number of cases in table: 236
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 7.474, df = 2, p-value = 0.02383
However, if you want to know the related information such as the chi-square contribution of each cell, you would nee to perform a chi-square test on the table and save the result for processing.
result = chisq.test(GenderSeat)
You can see the result of chisq.test
is similar to applying summary
to a table.
result
##
## Pearson's Chi-squared test
##
## data: GenderSeat
## X-squared = 7.4739, df = 2, p-value = 0.02383
The following shows the expected counts for each cell
result$expected
## Front Middle Back
## Female 33.05085 87.07627 29.87288
## Male 18.94915 49.92373 17.12712
The following shows the standardized residuals for each cell, \(\frac{O-E}{\sqrt{E}}\).
result$residuals
## Front Middle Back
## Female 0.6869302 0.4204836 -1.4404399
## Male -0.9072125 -0.5553228 1.9023550
The following shows the chi square contribution for each cell, \(\frac{(O-E)^2}{E}\).
result$residuals^2
## Front Middle Back
## Female 0.4718731 0.1768065 2.0748672
## Male 0.8230345 0.3083834 3.6189544
In the datase ICU
, we are testing if the variables Sex and Race are independent.
result = chisq.test(ICU$Sex, ICU$Race)
## Warning in chisq.test(ICU$Sex, ICU$Race): Chi-squared approximation may be
## incorrect
result
##
## Pearson's Chi-squared test
##
## data: ICU$Sex and ICU$Race
## X-squared = 0.16169, df = 2, p-value = 0.9223
The following shows the expected counts for each cell
result$expected
## ICU$Race
## ICU$Sex 1 2 3
## 0 108.5 9.3 6.2
## 1 66.5 5.7 3.8
The following shows the standardized residuals for each cell, \(\frac{O-E}{\sqrt{E}}\).
result$residuals
## ICU$Race
## ICU$Sex 1 2 3
## 0 -0.04800154 0.22953904 -0.08032193
## 1 0.06131393 -0.29319774 0.10259784
The following shows the chi square contribution for each cell, \(\frac{(O-E)^2}{E}\).
result$residuals^2
## ICU$Race
## ICU$Sex 1 2 3
## 0 0.002304147 0.052688172 0.006451613
## 1 0.003759398 0.085964912 0.010526316