EXERCISE WEEK 6

Author

Kamran Bin Lateef(N1340143)

LOADING PACKAGES

library(tibble)
library(tidyverse)
library(vtable)
library(ggplot2)
library(gplots)
library(graphics)
library(vcd)
library(corrplot)
library(dplyr)
library(tidyr)
library(knitr)
library(gmodels)

One-Proportion Z-Test in R

The One proportion Z-test is used to compare an observed proportion to a theoretical one, when there are only two categories. This article describes the basics of one-proportion z-test and provides practical examples using R software.

For example, we have a population of mice containing half male and have female (p = 0.5 = 50%). Some of these mice (n = 160) have developed a spontaneous cancer, including 95 male and 65 female.

R functions: binom.test() & prop.test()

The R functions binom.test() and prop.test() can be used to perform one-proportion test:

binom.test(): compute exact binomial test. Recommended when sample size is small prop.test(): can be used when sample size is large ( N > 30). It uses a normal approximation to binomial

res <- prop.test(x = 95, n = 160, p = 0.5, 
                 correct = FALSE)
# Printing the results
res 

    1-sample proportions test without continuity correction

data:  95 out of 160, null probability 0.5
X-squared = 5.625, df = 1, p-value = 0.01771
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5163169 0.6667870
sample estimates:
      p 
0.59375 
prop.test(x = 95, n = 160, p = 0.5, correct = FALSE,
           alternative = "less")

    1-sample proportions test without continuity correction

data:  95 out of 160, null probability 0.5
X-squared = 5.625, df = 1, p-value = 0.9911
alternative hypothesis: true p is less than 0.5
95 percent confidence interval:
 0.0000000 0.6555425
sample estimates:
      p 
0.59375 
# printing the p-value
res$p.value
[1] 0.01770607
# printing the mean
res$estimate
      p 
0.59375 
# printing the confidence interval
res$conf.int
[1] 0.5163169 0.6667870
attr(,"conf.level")
[1] 0.95

Two-Proportions Z-Test in R

The two-proportions z-test is used to compare two observed proportions. This article describes the basics of two-proportions *z-test and provides pratical examples using R sfoftware

For example, we have two groups of individuals:

Group A with lung cancer: n = 500 Group B, healthy individuals: n = 500 The number of smokers in each group is as follow:

Group A with lung cancer: n = 500, 490 smokers, pA=490/500=98

Compute two-proportions z-test

res <- prop.test(x = c(490, 400), n = c(500, 500))
# Printing the results
res 

    2-sample test for equality of proportions with continuity correction

data:  c(490, 400) out of c(500, 500)
X-squared = 80.909, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
 0.1408536 0.2191464
sample estimates:
prop 1 prop 2 
  0.98   0.80 
prop.test(x = c(490, 400), n = c(500, 500),
           alternative = "less")

    2-sample test for equality of proportions with continuity correction

data:  c(490, 400) out of c(500, 500)
X-squared = 80.909, df = 1, p-value = 1
alternative hypothesis: less
95 percent confidence interval:
 -1.0000000  0.2131742
sample estimates:
prop 1 prop 2 
  0.98   0.80 
# printing the p-value
res$p.value
[1] 2.363439e-19
# printing the mean
res$estimate
prop 1 prop 2 
  0.98   0.80 
# printing the confidence interval
res$conf.int
[1] 0.1408536 0.2191464
attr(,"conf.level")
[1] 0.95

Chi-square Goodness of Fit Test in R

The chi-square goodness of fit test is used to compare the observed distribution to an expected distribution, in a situation where we have two or more categories in a discrete data. In other words, it compares multiple observed proportions to expected probabilities.

R function: chisq.test()

tulip <- c(81, 50, 27)
res <- chisq.test(tulip, p = c(1/3, 1/3, 1/3))
res

    Chi-squared test for given probabilities

data:  tulip
X-squared = 27.886, df = 2, p-value = 8.803e-07
# Access to the expected values
res$expected
[1] 52.66667 52.66667 52.66667
tulip <- c(81, 50, 27)
res <- chisq.test(tulip, p = c(1/2, 1/3, 1/6))
res

    Chi-squared test for given probabilities

data:  tulip
X-squared = 0.20253, df = 2, p-value = 0.9037
# printing the p-value
res$p.value
[1] 0.9036928
# printing the mean
res$estimate
NULL

Chi-Square Test of Independence in R

# Import the data
file_path <- "http://www.sthda.com/sthda/RDoc/data/housetasks.txt"
housetasks <- read.delim(file_path, row.names = 1)
head(housetasks)
           Wife Alternating Husband Jointly
Laundry     156          14       2       4
Main_meal   124          20       5       4
Dinner       77          11       7      13
Breakfeast   82          36      15       7
Tidying      53          11       1      57
Dishes       32          24       4      53

1. convert the data as a table

dt <- as.table(as.matrix(housetasks))
# 2. Graph
balloonplot(t(dt), main ="housetasks", xlab ="", ylab="",
            label = FALSE, show.margins = FALSE)

Mosaicplot of the data

mosaicplot(dt, shade = TRUE, las=2,
           main = "housetasks")

Ploting a subset of the table

assoc(head(dt, 5), shade = TRUE, las=3)

Compute chi-square test in R

chisq <- chisq.test(housetasks)
chisq

    Pearson's Chi-squared test

data:  housetasks
X-squared = 1944.5, df = 36, p-value < 2.2e-16

Observed Counts

chisq$observed
           Wife Alternating Husband Jointly
Laundry     156          14       2       4
Main_meal   124          20       5       4
Dinner       77          11       7      13
Breakfeast   82          36      15       7
Tidying      53          11       1      57
Dishes       32          24       4      53
Shopping     33          23       9      55
Official     12          46      23      15
Driving      10          51      75       3
Finances     13          13      21      66
Insurance     8           1      53      77
Repairs       0           3     160       2
Holidays      0           1       6     153

Expected counts

round(chisq$expected,2)
            Wife Alternating Husband Jointly
Laundry    60.55       25.63   38.45   51.37
Main_meal  52.64       22.28   33.42   44.65
Dinner     37.16       15.73   23.59   31.52
Breakfeast 48.17       20.39   30.58   40.86
Tidying    41.97       17.77   26.65   35.61
Dishes     38.88       16.46   24.69   32.98
Shopping   41.28       17.48   26.22   35.02
Official   33.03       13.98   20.97   28.02
Driving    47.82       20.24   30.37   40.57
Finances   38.88       16.46   24.69   32.98
Insurance  47.82       20.24   30.37   40.57
Repairs    56.77       24.03   36.05   48.16
Holidays   55.05       23.30   34.95   46.70
round(chisq$residuals, 3)
             Wife Alternating Husband Jointly
Laundry    12.266      -2.298  -5.878  -6.609
Main_meal   9.836      -0.484  -4.917  -6.084
Dinner      6.537      -1.192  -3.416  -3.299
Breakfeast  4.875       3.457  -2.818  -5.297
Tidying     1.702      -1.606  -4.969   3.585
Dishes     -1.103       1.859  -4.163   3.486
Shopping   -1.289       1.321  -3.362   3.376
Official   -3.659       8.563   0.443  -2.459
Driving    -5.469       6.836   8.100  -5.898
Finances   -4.150      -0.852  -0.742   5.750
Insurance  -5.758      -4.277   4.107   5.720
Repairs    -7.534      -4.290  20.646  -6.651
Holidays   -7.419      -4.620  -4.897  15.556
corrplot(chisq$residuals, is.cor = FALSE)

Contibution in percentage (%)

contrib <- 100*chisq$residuals^2/chisq$statistic
round(contrib, 3)
            Wife Alternating Husband Jointly
Laundry    7.738       0.272   1.777   2.246
Main_meal  4.976       0.012   1.243   1.903
Dinner     2.197       0.073   0.600   0.560
Breakfeast 1.222       0.615   0.408   1.443
Tidying    0.149       0.133   1.270   0.661
Dishes     0.063       0.178   0.891   0.625
Shopping   0.085       0.090   0.581   0.586
Official   0.688       3.771   0.010   0.311
Driving    1.538       2.403   3.374   1.789
Finances   0.886       0.037   0.028   1.700
Insurance  1.705       0.941   0.868   1.683
Repairs    2.919       0.947  21.921   2.275
Holidays   2.831       1.098   1.233  12.445

Visualize the contribution

corrplot(contrib, is.cor = FALSE)

EXERCISE 4

Loading Data

mpg
# A tibble: 234 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
# ℹ 224 more rows

GROUP AND SUMMEARIZE

mpg%>%
  group_by(class, cyl)%>%
  summarize(n=n())%>%
  kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
class cyl n
2seater 8 5
compact 4 32
compact 5 2
compact 6 13
midsize 4 16
midsize 6 23
midsize 8 2
minivan 4 1
minivan 6 10
pickup 4 3
pickup 6 10
pickup 8 20
subcompact 4 21
subcompact 5 2
subcompact 6 7
subcompact 8 5
suv 4 8
suv 6 16
suv 8 38

4.3 Contingency table of data (mpg)

# 1. Contingency table
mpg_counts <- mpg %>%
  group_by(class, cyl) %>%
  summarise(n = n(), .groups = "drop") %>%
  spread(cyl, n, fill = 0)

# Convert to matrix format for the chi-square test
mpg_matrix <- as.matrix(mpg_counts[, -1])
rownames(mpg_matrix) <- mpg_counts$class

4.4 Chi-square test (mgp)

# 1. Chi-square test of the data
chisq <- chisq.test(mpg_matrix)
Warning in chisq.test(mpg_matrix): Chi-squared approximation may be incorrect
chisq

    Pearson's Chi-squared test

data:  mpg_matrix
X-squared = 138.03, df = 18, p-value < 2.2e-16
# 2. Observed counts
chisq$observed
            4 5  6  8
2seater     0 0  0  5
compact    32 2 13  0
midsize    16 0 23  2
minivan     1 0 10  0
pickup      3 0 10 20
subcompact 21 2  7  5
suv         8 0 16 38
# 3. Expected counts
round(chisq$expected,2)
               4    5     6     8
2seater     1.73 0.09  1.69  1.50
compact    16.27 0.80 15.87 14.06
midsize    14.19 0.70 13.84 12.26
minivan     3.81 0.19  3.71  3.29
pickup     11.42 0.56 11.14  9.87
subcompact 12.12 0.60 11.82 10.47
suv        21.46 1.06 20.93 18.55

4.5 Extraction of the Pearson residual (mpg)

# 1. extraction of the Pearson residual
round(chisq$residuals, 3)
                4      5      6      8
2seater    -1.316 -0.292 -1.299  2.865
compact     3.900  1.335 -0.720 -3.750
midsize     0.480 -0.837  2.462 -2.931
minivan    -1.439 -0.434  3.262 -1.814
pickup     -2.492 -0.751 -0.342  3.224
subcompact  2.553  1.812 -1.401 -1.691
suv        -2.906 -1.029 -1.078  4.517
# 2. Visualizing Pearson residual
corrplot(chisq$residuals, is.cor = FALSE)

4.6 CrossTable of data (mpg)

# Create a crosstable 
CrossTable(mpg$class, mpg$cyl)

 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  234 

 
             | mpg$cyl 
   mpg$class |         4 |         5 |         6 |         8 | Row Total | 
-------------|-----------|-----------|-----------|-----------|-----------|
     2seater |         0 |         0 |         0 |         5 |         5 | 
             |     1.731 |     0.085 |     1.688 |     8.210 |           | 
             |     0.000 |     0.000 |     0.000 |     1.000 |     0.021 | 
             |     0.000 |     0.000 |     0.000 |     0.071 |           | 
             |     0.000 |     0.000 |     0.000 |     0.021 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
     compact |        32 |         2 |        13 |         0 |        47 | 
             |    15.210 |     1.782 |     0.518 |    14.060 |           | 
             |     0.681 |     0.043 |     0.277 |     0.000 |     0.201 | 
             |     0.395 |     0.500 |     0.165 |     0.000 |           | 
             |     0.137 |     0.009 |     0.056 |     0.000 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
     midsize |        16 |         0 |        23 |         2 |        41 | 
             |     0.230 |     0.701 |     6.059 |     8.591 |           | 
             |     0.390 |     0.000 |     0.561 |     0.049 |     0.175 | 
             |     0.198 |     0.000 |     0.291 |     0.029 |           | 
             |     0.068 |     0.000 |     0.098 |     0.009 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
     minivan |         1 |         0 |        10 |         0 |        11 | 
             |     2.070 |     0.188 |    10.641 |     3.291 |           | 
             |     0.091 |     0.000 |     0.909 |     0.000 |     0.047 | 
             |     0.012 |     0.000 |     0.127 |     0.000 |           | 
             |     0.004 |     0.000 |     0.043 |     0.000 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
      pickup |         3 |         0 |        10 |        20 |        33 | 
             |     6.211 |     0.564 |     0.117 |    10.391 |           | 
             |     0.091 |     0.000 |     0.303 |     0.606 |     0.141 | 
             |     0.037 |     0.000 |     0.127 |     0.286 |           | 
             |     0.013 |     0.000 |     0.043 |     0.085 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
  subcompact |        21 |         2 |         7 |         5 |        35 | 
             |     6.515 |     3.284 |     1.963 |     2.858 |           | 
             |     0.600 |     0.057 |     0.200 |     0.143 |     0.150 | 
             |     0.259 |     0.500 |     0.089 |     0.071 |           | 
             |     0.090 |     0.009 |     0.030 |     0.021 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
         suv |         8 |         0 |        16 |        38 |        62 | 
             |     8.444 |     1.060 |     1.162 |    20.403 |           | 
             |     0.129 |     0.000 |     0.258 |     0.613 |     0.265 | 
             |     0.099 |     0.000 |     0.203 |     0.543 |           | 
             |     0.034 |     0.000 |     0.068 |     0.162 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
Column Total |        81 |         4 |        79 |        70 |       234 | 
             |     0.346 |     0.017 |     0.338 |     0.299 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|