Hypothesis Testing: Birth

The lab must be uploaded on BrightSpace in pdf format
The work must contain R codes, R outputs and interpretations
Deadline for submission of work: Tuesday, February 28 at 11:59 p.m.

Goals :

Descriptive analysis of a categorical variable
Frequency table and diagrams
Confidence interval for a response variable
Hypothesis test on the proportion π
2 × 2 contingency table
Confidence interval of RR and OR
Hypothesis test on π1 − π2
Results interpretation

Application exercise

The “Birthweight” dataset is a survey of risk factors associated with low birthweight infants (data collected during Baystate Medical Center in Massachusetts during 1986). The weak birth weight is an event that has interested physicians for several years because of the very high infant mortality rate and infant abnormality rate high in low birth weight infants. A woman’s behavior during pregnancy (diet, smoking habits, etc.) can significantly alter the chances of carrying the pregnancy to term, and therefore of giving birth to a child of normal weight. A child is considered to have low birth weight if this is less than 2500 g.

knitr::opts_chunk$set(comment = NA)

1. Download the “Birth_weight.xls” data file directly from the Internetusing the RStudio import window: Import Dataset

Install the following libraries to help in data manipulation and display of summary statistics

if(!require(dplyr)){install.packages('dplyr')} #installing the package if not

Loading required package: dplyr


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(stargazer)


Please cite as:

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

library(dplyr) #loading the library
library(gtsummary)

Warning: package 'gtsummary' was built under R version 4.2.2

library(epiDisplay)

Warning: package 'epiDisplay' was built under R version 4.2.2

Loading required package: foreign

Loading required package: survival

Loading required package: MASS

Warning: package 'MASS' was built under R version 4.2.2


Attaching package: 'MASS'

The following object is masked from 'package:gtsummary':

    select

The following object is masked from 'package:dplyr':

    select

Loading required package: nnet

library(vtable)

Warning: package 'vtable' was built under R version 4.2.2

Loading required package: kableExtra

Warning: package 'kableExtra' was built under R version 4.2.2

Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
%in% : 'length(x) = 3 > 1' in coercion to 'logical(1)'


Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows

library(wilson)

Warning: package 'wilson' was built under R version 4.2.2


Attaching package: 'wilson'

The following object is masked from 'package:stats':

    heatmap

library(fastR2)

Warning: package 'fastR2' was built under R version 4.2.2

Loading required package: mosaic

Registered S3 method overwritten by 'mosaic':
  method                           from   
  fortify.SpatialPolygonsDataFrame ggplot2


The 'mosaic' package masks several functions from core packages in order to add 
additional features.  The original behavior of these functions should not be affected by this.


Attaching package: 'mosaic'

The following object is masked from 'package:Matrix':

    mean

The following object is masked from 'package:ggplot2':

    stat

The following objects are masked from 'package:dplyr':

    count, do, tally

The following objects are masked from 'package:stats':

    binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
    quantile, sd, t.test, var

The following objects are masked from 'package:base':

    max, mean, min, prod, range, sample, sum


Attaching package: 'fastR2'

The following object is masked from 'package:MASS':

    Traffic

library(readxl)
library(epitools)


Attaching package: 'epitools'

The following object is masked from 'package:survival':

    ratetable

library(mosaic)
library(finalfit)

Warning: package 'finalfit' was built under R version 4.2.2

library(dplyr)
library(magrittr)


Attaching package: 'magrittr'

The following object is masked from 'package:wilson':

    and

library(finalfit)
library(ggplot2)
library(psych)

Warning: package 'psych' was built under R version 4.2.2


Attaching package: 'psych'

The following objects are masked from 'package:mosaic':

    logit, rescale

The following objects are masked from 'package:ggplot2':

    %+%, alpha

The following object is masked from 'package:wilson':

    pca

The following objects are masked from 'package:epiDisplay':

    alpha, cs, lookup

library(corrplot)

Warning: package 'corrplot' was built under R version 4.2.2

corrplot 0.92 loaded

library(VIM)

Warning: package 'VIM' was built under R version 4.2.2

Loading required package: colorspace

Loading required package: grid

VIM is ready to use.

Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues


Attaching package: 'VIM'

The following object is masked from 'package:vtable':

    countNA

The following object is masked from 'package:datasets':

    sleep

library(gridExtra)


Attaching package: 'gridExtra'

The following object is masked from 'package:dplyr':

    combine

library(car)

Warning: package 'car' was built under R version 4.2.2

Loading required package: carData


Attaching package: 'car'

The following object is masked from 'package:psych':

    logit

The following objects are masked from 'package:mosaic':

    deltaMethod, logit

The following object is masked from 'package:dplyr':

    recode

library(knitr)
library(gmodels)

Warning: package 'gmodels' was built under R version 4.2.2


Attaching package: 'gmodels'

The following object is masked from 'package:epiDisplay':

    ci

Load the dataset

Poids_naissance <- read.csv("C:\\Users\\user\\Downloads\\Poids_naissance.csv")

2. Display the first 3 lines of the file

head(Poids_naissance,3)

3. Redefine the array using the as.data.frame() command, then use the attach() command

Poids_naissance <-data.frame(Poids_naissance)

Attach the dataset

attach(Poids_naissance)

4. Structuring categorical variables

Structuring categorical variables refers to organizing and preparing categorical data for analysis using the R programming language. Categorical variables are those that represent qualitative or categorical data, such as gender, race, education level, or favorite color.

In RStudio, categorical variables can be structured using factors, which are a data type in R that represent categorical variables with a fixed set of levels. Factors allow for the efficient storage and analysis of categorical data, as well as the ability to perform statistical tests and visualizations specific to categorical data.

To structure a categorical variable as a factor in RStudio, you can use the factor() function. This function takes a vector of categorical data as its first argument, and a set of levels for the variable as its second argument (if not specified, the levels will be automatically determined based on the unique values in the data). ` #### Factor the categorical variables

Poids_naissance$RACE<-factor(Poids_naissance$RACE, levels = c(1,2,3),
                              labels = c("White", "Black", "Other"))

Poids_naissance<- Poids_naissance %>%mutate(RACE = factor(RACE))
Poids_naissance<- Poids_naissance %>%mutate(LOW = factor(LOW))
Poids_naissance<- Poids_naissance %>%mutate(SMOKE = factor(SMOKE))
Poids_naissance<- Poids_naissance %>%mutate(PTL = factor(PTL))
Poids_naissance<- Poids_naissance %>%mutate(HT = factor(HT))
Poids_naissance<- Poids_naissance %>%mutate(UI = factor(UI))
Poids_naissance<- Poids_naissance %>%mutate(FVT = factor(FVT))

5. For the “RACE” variable, construct a frequency table and a diagram appropriate

Frequency Table

tab1(Poids_naissance$RACE)

Poids_naissance$RACE : 
        Frequency Percent Cum. percent
White          96    50.8         50.8
Black          26    13.8         64.6
Other          67    35.4        100.0
  Total       189   100.0        100.0

From the results above, there were 96 individuals from the race category (1), 26 individuals from race category (2) and 67 individuals from the race category 3. This represents 51%, 26% and 35% of the distribution races in the data, respectively. This bar above is the visual representation of the distribution of races in the dataset.

6. Calculate the mean and standard deviations of continuous quantitative variables

sum_stat <- data.frame(Poids_naissance$AGE, Poids_naissance$LWT, Poids_naissance$BWT)

stargazer(Poids_naissance[,-1], type = "text")


==========================================
Statistic  N    Mean    St. Dev. Min  Max 
------------------------------------------
AGE       189  23.238    5.299   14   45  
LWT       189  129.815   30.579  80   250 
BWT       189 2,944.656 729.022  709 4,990
------------------------------------------

Summary Statistics for a few selected variables

st(sum_stat, col.breaks = 3,
   summ = list(
     c('notNA(x)','mean(x)', 'median(x)','sd(x^2)','min(x)','max(x)'),
     c('notNA(x)','mean(x)')
   ),
   summ.names = list(
     c('N','Mean','median','SD','Min','Max'),
     c('Count','Percent')
   ))

Summary Statistics
Variable	N	Mean	median	SD	Min	Max
Poids_naissance.AGE	189	23.238	23	269.332	14	45
Poids_naissance.LWT	189	129.815	121	9347.597	80	250
Poids_naissance.BWT	189	2944.656	2977	4244457.151	709	4990

Summary Statistics of all variables

st(Poids_naissance, col.breaks = 11,
   summ = list(
     c('notNA(x)','mean(x)', 'median(x)','sd(x^2)','min(x)','max(x)', 'skew(x)'),
     c('notNA(x)','mean(x)')
   ),
   summ.names = list(
     c('N','Mean','median','SD','Min','Max', 'skew'),
     c('Count','Percent')
   ))

Summary Statistics
Variable	N	Mean	median	SD	Min	Max	skew
ID	189	121.079	123	15468.655	4	226	-0.074
AGE	189	23.238	23	269.332	14	45	0.711
LWT	189	129.815	121	9347.597	80	250	1.38
RACE	189
… White	96	50.8%
… Black	26	13.8%
… Other	67	35.4%
SMOKE	189
… 0	115	60.8%
… 1	74	39.2%
PTL	189
… 0	159	84.1%
… 1	24	12.7%
… 2	5	2.6%
… 3	1	0.5%
HT	189
… 0	177	93.7%
… 1	12	6.3%
UI	189
… 0	161	85.2%
… 1	28	14.8%
FVT	189
… 0	100	52.9%
… 1	47	24.9%
… 2	30	15.9%
… 3	7	3.7%
… 4	4	2.1%
… 6	1	0.5%
BWT	189	2944.656	2977	4244457.151	709	4990	-0.207
LOW	189
… 0	130	68.8%
… 1	59	31.2%

Alternative Summary Statistics

describe(Poids_naissance[,2:11])

Other Grouped Summary Statistics

Poids_naissance [,c(2,3,4,5,6,7,8,9,10),11] %>%
  tbl_summary(by = SMOKE) %>%
  add_p() %>%
  add_overall() %>% 
  bold_labels()

Characteristic	Overall, N = 189¹	0, N = 115¹	1, N = 74¹	p-value²
AGE	23 (19, 26)	23 (20, 26)	22 (19, 26)	0.5
LWT	121 (110, 140)	124 (112, 142)	120 (107, 137)	0.2
RACE				<0.001
White	96 (51%)	44 (38%)	52 (70%)
Black	26 (14%)	16 (14%)	10 (14%)
Other	67 (35%)	55 (48%)	12 (16%)
PTL				0.036
0	159 (84%)	103 (90%)	56 (76%)
1	24 (13%)	10 (8.7%)	14 (19%)
2	5 (2.6%)	2 (1.7%)	3 (4.1%)
3	1 (0.5%)	0 (0%)	1 (1.4%)
HT				>0.9
0	177 (94%)	108 (94%)	69 (93%)
1	12 (6.3%)	7 (6.1%)	5 (6.8%)
UI				0.4
0	161 (85%)	100 (87%)	61 (82%)
1	28 (15%)	15 (13%)	13 (18%)
FVT				0.12
0	100 (53%)	55 (48%)	45 (61%)
1	47 (25%)	35 (30%)	12 (16%)
2	30 (16%)	19 (17%)	11 (15%)
3	7 (3.7%)	3 (2.6%)	4 (5.4%)
4	4 (2.1%)	3 (2.6%)	1 (1.4%)
6	1 (0.5%)	0 (0%)	1 (1.4%)
BWT	2,977 (2,414, 3,475)	3,100 (2,509, 3,622)	2,776 (2,370, 3,246)	0.007
¹ Median (IQR); n (%)
² Wilcoxon rank sum test; Pearson's Chi-squared test; Fisher's exact test

7. For the binary variable “LOW” (birth weight less than or equal to 2500 g), determine a Wilson confidence interval. Interpret

Get the proportion of success and failure

tab1(Poids_naissance$LOW)

Poids_naissance$LOW : 
        Frequency Percent Cum. percent
0             130    68.8         68.8
1              59    31.2        100.0
  Total       189   100.0        100.0

The test is used to determine if the proportion of successes in a sample is significantly different from a hypothesized value (in this case, 0.5). From the output above, we have 59 success (birth weight less than or equal to 2500 g), and 130 failure . We can therefore perform the the wilson confidence interval for proportion as shown below.

prop.test(59,189)


    1-sample proportions test with continuity correction

data:  59 out of 189
X-squared = 25.926, df = 1, p-value = 3.548e-07
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.2479596 0.3841585
sample estimates:
        p 
0.3121693

prop.test(59,189, correct=FALSE)


    1-sample proportions test without continuity correction

data:  59 out of 189
X-squared = 26.672, df = 1, p-value = 2.411e-07
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.2504031 0.3814188
sample estimates:
        p 
0.3121693

The sample size is 189, and out of those 189, 59 were successes. The sample proportion of successes (p) is 0.3121693. The results of the test with continuity correction show a chi-squared value of 25.926, with 1 degree of freedom and a p-value of 3.548e-07. The alternative hypothesis is that the true proportion (p) is not equal to 0.5. The 95% confidence interval for the true proportion ranges from 0.2479596 to 0.3841585.

The results of the test without continuity correction show a chi-squared value of 26.672, with 1 degree of freedom and a p-value of 2.411e-07. The alternative hypothesis is the same as before. The 95% confidence interval for the true proportion ranges from 0.2504031 to 0.3814188.

Both tests rejected the null hypothesis that the true proportion is equal to 0.5, since the p-values are less than the significance level (typically 0.05). The confidence intervals also contain 0.5. The difference between the two tests is minor, with the continuity correction providing slightly narrower confidence intervals.

In 1995, researchers explored the link between uterine irritability “UI” and preterm labor. They found that 18.7% of women with uterine irritability had labor premature, compared to 11% of women who did not have this complication. At the 5% threshold, perform a hypothesis test on the proportion of women who suffer from UI. Conclude

tab1(Poids_naissance$UI)

Poids_naissance$UI : 
        Frequency Percent Cum. percent
0             161    85.2         85.2
1              28    14.8        100.0
  Total       189   100.0        100.0

# Input the counts

n <- 28
x <- (18.7 / 100)*n

In the code x <- ((18.7/100) * n), x is the number of women with UI who had preterm labor, which is calculated by multiplying the proportion of women with UI who had preterm labor (18.7%) by the total number of women with UI (28). This gives us x = 18.7 * 100 = 18.7.

So x is the count of the number of women with UI who had preterm labor in the sample.

# Perform the test
prop.test(x, n, p = 0.5, alternative = "two.sided", conf.level = 0.95)


    1-sample proportions test with continuity correction

data:  x out of n
X-squared = 9.7562, df = 1, p-value = 0.001787
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.07286863 0.38510001
sample estimates:
    p 
0.187

The p-value is 0.001787, which is less than 0.05, so we reject the null hypothesis. The confidence interval for the true proportion ranges from 0.07286863 to 0.38510001, which does not include 0.5. This suggests that the proportion of women who suffer from UI and had preterm labor is different from the proportion of women who did not suffer this complication at a 5% level of significance.

Therefore, we can conclude that there is a significant difference between the proportion of women with UI and preterm labor and those without this complication, based on the given data and significance level.

9. Create a contingency table crossing the “LOW” response and smoking during pregnancy

A contingency table, also known as a cross-tabulation table, is a table that displays the distribution of two or more categorical variables. The table presents the frequency or count of each combination of the variables in rows and columns.

Each row in the table represents a level of one categorical variable, while each column represents a level of the other categorical variable. The cells in the table contain the number of observations that fall into each combination of the two variables. The table can also display percentages or proportions instead of counts, depending on the purpose of the analysis.

Contingency tables are commonly used in statistical analysis to examine the relationship between two or more categorical variables, such as gender and occupation or region of residence and political party affiliation. They can help identify patterns and associations between variables, and can be used to test hypotheses about the relationship between them. Chi-square tests and Fisher’s exact test are commonly used statistical tests for analyzing contingency tables.

dat <- Poids_naissance[,c(5,11)] %>%
  tbl_summary(by = LOW) %>%
  add_p() %>%
  add_overall() %>% 
  bold_labels()
dat

Characteristic	Overall, N = 189¹	0, N = 130¹	1, N = 59¹	p-value²
SMOKE				0.026
0	115 (61%)	86 (66%)	29 (49%)
1	74 (39%)	44 (34%)	30 (51%)
¹ n (%)
² Pearson's Chi-squared test

Summary Statistics of other Variables

Poids_naissance [,c(2,3,5,10)] %>%
  tbl_summary(by = SMOKE) %>%
  add_p() %>%
  add_overall() %>% 
  bold_labels()

Characteristic	Overall, N = 189¹	0, N = 115¹	1, N = 74¹	p-value²
AGE	23 (19, 26)	23 (20, 26)	22 (19, 26)	0.5
LWT	121 (110, 140)	124 (112, 142)	120 (107, 137)	0.2
BWT	2,977 (2,414, 3,475)	3,100 (2,509, 3,622)	2,776 (2,370, 3,246)	0.007
¹ Median (IQR)
² Wilcoxon rank sum test

The table represents a contingency table with two categorical variables: “SMOKE” and “LOW”, and the corresponding counts or percentages for each combination of the two variables.

The “LOW” variable has two levels: 0 and 1, with 130 observations in level 0 and 59 observations in level 1, and the “SMOKE” variable also has two levels: 0 and 1, with 189 observations in total.

The p-value of 0.026 indicates the result of a statistical test to evaluate the association between the two variables using Pearson’s Chi-squared test. The null hypothesis is that the two variables are independent, meaning that the distribution of one variable does not depend on the distribution of the other variable. The alternative hypothesis is that the two variables are dependent, meaning that the distribution of one variable does depend on the distribution of the other variable. The p-value of 0.026 is less than the commonly used threshold of 0.05, indicating that we can reject the null hypothesis and conclude that there is a statistically significant association between the “SMOKE” variable and the “LOW” variable. Specifically, it appears that individuals in level 1 of the “LOW” variable are more likely to have a value of 1 in the “SMOKE” variable than individuals in level 0 of the “Characteristic” variable. In other words, individuals who smoke during pregnancy are more likely going to give birth to children with birth weight less than or equal to 2500 g.

10. Is smoking during pregnancy a potential factor in causing premature birth? To answer this question, present the three measures of comparison and interpret each of these measurements. :

Proportion difference test
Relative Risk Confidence Interval
Odds ratio confidence interval The p-value of 0.026 indicates the result of a statistical test to evaluate the association between the two variables using Pearson’s Chi-squared test. The null hypothesis is that the two variables are independent, meaning that the distribution of one variable does not depend on the distribution of the other variable. The alternative hypothesis is that the two variables are dependent, meaning that the distribution of one variable does depend on the distribution of the other variable. The p-value of 0.026 is less than the commonly used threshold of 0.05, indicating that we can reject the null hypothesis and conclude that there is a statistically significant association between the “SMOKE” variable and the “LOW” variable. Specifically, it appears that individuals in level 1 of the “LOW” variable are more likely to have a value of 1 in the “SMOKE” variable than individuals in level 0 of the “Characteristic” variable. In other words, individuals who smoke during pregnancy are more likely going to give birth to children with birth weight less than or equal to 2500 g. We can confidently conclude that smoking during pregnancy a potential factor in causing premature birth.

Odd Ratio Confidence Interval

TBL<-table(Poids_naissance$SMOKE, Poids_naissance$LOW) 
TBL

Perform the test

test <- fisher.test(TBL)
test


    Fisher's Exact Test for Count Data

data:  TBL
p-value = 0.03618
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.028780 3.964904
sample estimates:
odds ratio 
  2.014137

Fisher’s Exact Test is a statistical test used to determine the association between two categorical variables, as well as the odds ratio confidence interval. From the output you provided, I have conducted conducted Fisher’s Exact Test on count data and obtained a p-value of 0.03618. The null hypothesis is that there is no association between the two categorical variables, while the alternative hypothesis is that there is an association.

The output also shows that the calculated odds ratio is 2.014137, which means that the odds of the event occurring in one group are two times greater than the odds of the event occurring in the other group. The 95% confidence interval for the odds ratio ranges from 1.028780 to 3.964904, indicating that there is a 95% probability that the true odds ratio falls within this range. Since the p-value is less than 0.05, which is a commonly used significance level, we can reject the null hypothesis and conclude that there is evidence of an association between the two categorical variables.

Proportion Difference Test

tbl <- matrix(c(86, 29, 44, 30), nrow = 2, byrow = TRUE)
colnames(tbl) <- c("LOW[yes]", "LOW[no]")
rownames(tbl) <- c("SMOKE[yes]", "SMOKE[no]")

View the table

tbl

           LOW[yes] LOW[no]
SMOKE[yes]       86      29
SMOKE[no]        44      30

# conduct the test
prop.test(tbl)

Warning in stats::prop.test(x = count, n = n, p = p, alternative =
alternative, : Chi-squared approximation may be incorrect


    1-sample proportions test with continuity correction

data:  tbl  [with success = 29]
X-squared = 0.25, df = 1, p-value = 0.6171
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.01319116 0.78057348
sample estimates:
   p 
0.25

The output above suggests that a 1-sample proportions test with continuity correction was conducted on a dataset presented in the contingency table tbl, with a sample size of 29 and a success count of 29*0.25=7.25 (assuming that the expected proportion of success was 0.5). The null hypothesis is that the true proportion of success is equal to 0.5, while the alternative hypothesis is that it is not equal to 0.5. The test statistic used is the chi-squared statistic, which has a chi-squared distribution with 1 degree of freedom.

The output shows that the calculated chi-squared statistic is 0.25 and the p-value is 0.6171. Since the p-value is greater than the commonly used significance level of 0.05, we fail to reject the null hypothesis and conclude that there is not enough evidence to suggest that the true proportion of success is different from 0.5. The 95% confidence interval for the true proportion of success ranges from 0.01319116 to 0.78057348. This indicates that there is a 95% probability that the true proportion of success falls within this range. The sample estimate of the proportion of success is 0.25.

Relative Risk Confidence Interval

epitools::riskratio(tbl,rev='both',method = 'wald')

$data
           LOW[no] LOW[yes] Total
SMOKE[no]       30       44    74
SMOKE[yes]      29       86   115
Total           59      130   189

$measure
                        NA
risk ratio with 95% C.I. estimate    lower    upper
              SMOKE[no]  1.000000       NA       NA
              SMOKE[yes] 1.257708 1.013374 1.560953

$p.value
            NA
two-sided    midp.exact fisher.exact chi.square
  SMOKE[no]          NA           NA         NA
  SMOKE[yes] 0.02914865    0.0361765 0.02649064

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

The output above is from a 2x2 contingency table analysis using Fisher’s exact test to compare the proportion of “LOW” responses between two groups (“SMOKE[no]” and “SMOKE[yes]”). The contingency table shows the frequency distribution of the “LOW” responses for each group, as well as the total number of observations in each group.

The “measure” section of the output shows the estimated relative risk (risk ratio) and its 95% confidence interval (CI) based on the contingency table. The relative risk is 1.2577, which suggests that the proportion of “LOW” responses is higher in the “SMOKE[yes]” group compared to the “SMOKE[no]” group. The 95% CI for the relative risk ranges from 1.0134 to 1.5610.

The “p.value” section of the output shows the p-value for the test of the null hypothesis that the proportions of “LOW” responses in the two groups are equal. The p-value is 0.0291, which is less than the significance level of 0.05, suggesting that there is evidence of a significant difference between the two groups. The p-value is provided for different test options: midp.exact, fisher.exact, and chi-square.

The “correction” section of the output indicates whether a continuity correction was applied to the test statistic. In this case, the continuity correction was not applied. The “method” attribute indicates that the unconditional maximum likelihood estimation (MLE) and normal approximation (Wald) CI method was used to estimate the relative risk and its confidence interval.

11. Redo question 10 for the factor “UI” and the answer “LOW”

TBL2<-table(Poids_naissance$UI, Poids_naissance$LOW) 
TBL2

tbl_1 <- matrix(c(116, 45, 14, 14), nrow = 2, byrow = TRUE)
colnames(tbl_1) <- c("LOW[yes]", "LOW[no]")
rownames(tbl_1) <- c("UI[yes]", "UI[no]")

View the two-way table

tbl_1

        LOW[yes] LOW[no]
UI[yes]      116      45
UI[no]        14      14

Odds Ratio Confidence Interval

test <- fisher.test(tbl_1)
test


    Fisher's Exact Test for Count Data

data:  tbl_1
p-value = 0.02692
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.041921 6.324135
sample estimates:
odds ratio 
  2.563399

The p-value for the test is 0.02692, which is less than the significance level of 0.05, indicating that there is evidence of a significant association between the two variables in the contingency table. The alternative hypothesis is that the true odds ratio is not equal to 1.

The 95% confidence interval for the odds ratio ranges from 1.0419 to 6.3241. This means that with 95% confidence, we can say that the true odds ratio lies within this interval. Since the interval excludes 1, it suggests that the odds of having a certain outcome are significantly different between the two groups in the contingency table.

The estimated odds ratio is 2.5634, which is the point estimate of the odds ratio based on the contingency table. This suggests that the odds of having the outcome of interest is 2.5634 times higher in one group compared to the other group.

Proportion Difference Test

prop.test(tbl_1)

Warning in stats::prop.test(x = count, n = n, p = p, alternative =
alternative, : Chi-squared approximation may be incorrect


    1-sample proportions test without continuity correction

data:  tbl_1  [with success = 14]
X-squared = 0, df = 1, p-value = 1
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.150039 0.849961
sample estimates:
  p 
0.5

The output shows the test statistic, degrees of freedom, and p-value for the test. In this case, the test statistic is X-squared with 1 degree of freedom, and the p-value is 1. This means that there is no evidence to reject the null hypothesis that the true proportion is equal to 0.5. The alternative hypothesis is that the true proportion is not equal to 0.5.

The 95% confidence interval for the proportion ranges from 0.1500 to 0.8500. This means that with 95% confidence, we can say that the true proportion lies within this interval. Since the interval includes 0.5, it suggests that the true proportion may be equal to 0.5. The sample estimate of the proportion is 0.5, which is the point estimate of the proportion based on the data used in the analysis.

Relative Risk Confidence Interval

epitools::riskratio(tbl_1,rev='both',method = 'wald')

$data
        LOW[no] LOW[yes] Total
UI[no]       14       14    28
UI[yes]      45      116   161
Total        59      130   189

$measure
                        NA
risk ratio with 95% C.I. estimate     lower    upper
                 UI[no]  1.000000        NA       NA
                 UI[yes] 1.440994 0.9827936 2.112817

$p.value
         NA
two-sided midp.exact fisher.exact chi.square
  UI[no]          NA           NA         NA
  UI[yes] 0.02648209   0.02691811 0.02012792

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

The contingency table used in the analysis shows the counts of two categorical variables, UI and LOW, with two levels each. The measure section of the output shows the estimated risk ratio with a 95% confidence interval. The risk ratio estimates the ratio of the probability of an event occurring in one group relative to the probability of the same event occurring in another group. In this case, the estimated risk ratio for UI[yes] compared to UI[no] is 1.44. The estimated risk ratio of 1.44 means that the risk (or probability) of the event occurring in the UI[yes] group is 1.44 times higher than the risk of the same event occurring in the UI[no] group. In other words, if the event under consideration was rare, and its probability was 0.05 in the UI[no] group, then we would expect its probability to be 0.05 * 1.44 = 0.072 in the UI[yes] group. The 95% confidence interval for the risk ratio ranges from 0.98 to 2.11. This means that with 95% confidence, we can say that the true risk ratio lies within this interval. The confidence interval includes 1, which suggests that there may not be a significant difference in risk between the two groups.

The p-value section of the output shows the p-value for the two-sided test of the null hypothesis that the risk ratio is equal to 1. In this case, the p-value is 0.027, which is less than the typical significance level of 0.05. Therefore, we can reject the null hypothesis and conclude that there is evidence of a significant difference in risk between the two groups.

The correction section of the output shows that no continuity correction was applied in the analysis. Finally, the method section of the output describes the statistical method used to compute the confidence interval. In this case, an unconditional maximum likelihood estimate (MLE) was used with a normal approximation (Wald) confidence interval.

Hypothesis Testing: Birth_Weight

LUMUMBA W. VICTOR

2023-02-27

Goals :

Application exercise

1. Download the “Birth_weight.xls” data file directly from the Internetusing the RStudio import window: Import Dataset

Load the dataset

2. Display the first 3 lines of the file

3. Redefine the array using the as.data.frame() command, then use the attach() command

Attach the dataset

4. Structuring categorical variables

5. For the “RACE” variable, construct a frequency table and a diagram appropriate

Frequency Table

6. Calculate the mean and standard deviations of continuous quantitative variables

Summary Statistics for a few selected variables

Summary Statistics of all variables

Alternative Summary Statistics

Other Grouped Summary Statistics

7. For the binary variable “LOW” (birth weight less than or equal to 2500 g), determine a Wilson confidence interval. Interpret

Get the proportion of success and failure

9. Create a contingency table crossing the “LOW” response and smoking during pregnancy

Summary Statistics of other Variables

10. Is smoking during pregnancy a potential factor in causing premature birth? To answer this question, present the three measures of comparison and interpret each of these measurements. :

Odd Ratio Confidence Interval

Perform the test

Proportion Difference Test

View the table

Relative Risk Confidence Interval

11. Redo question 10 for the factor “UI” and the answer “LOW”

View the two-way table

Odds Ratio Confidence Interval

Proportion Difference Test

Relative Risk Confidence Interval