Getting Familiar with Categorical Data in R

Homework #1 is worth 100 points and each question is worth 6.5 points each.

Submission Instructions: save the .HTML file as ‘Familiar_ Categorical_Data_Assignmentyourlastname.HTML’ and upload the HTML file to the assignment entitled ‘Getting Familiar with Categorical Data in R’ on Canvas on or before Wednesday November 13, 2019 by 11:59p.m. EST. No late assignments are accepted.

2.1 p.p. 60-61

Run the code chunk below.

library(vcd)

## Loading required package: grid

library(grid)
library(gnm)
library(vcdExtra)

ds <- datasets(package = c("vcd", "vcdExtra"))
str(ds, vec.len=2)

## 'data.frame':    76 obs. of  5 variables:
##  $ Package: chr  "vcd" "vcd" ...
##  $ Item   : chr  "Arthritis" "Baseball" ...
##  $ class  : chr  "data.frame" "data.frame" ...
##  $ dim    : chr  "84x5" "322x25" ...
##  $ Title  : chr  "Arthritis Treatment Data" "Baseball Data" ...

View(ds)

View(UCBAdmissions)
str(UCBAdmissions)

##  'table' num [1:2, 1:2, 1:6] 512 313 89 19 353 207 17 8 120 205 ...
##  - attr(*, "dimnames")=List of 3
##   ..$ Admit : chr [1:2] "Admitted" "Rejected"
##   ..$ Gender: chr [1:2] "Male" "Female"
##   ..$ Dept  : chr [1:6] "A" "B" "C" "D" ...

How many data sets are there altogether? How many are there in each package?

nrow(ds)

## [1] 76

ds1 = datasets(package = "vcd")
nrow(ds1)

## [1] 33

ds2 = datasets(package = "vcdExtra")
nrow(ds2)

## [1] 43

Make a tabular display of the frequencies by Package and class.

table(ds$Package, ds$class)

##           
##            array data.frame matrix table
##   vcd          1         17      0    15
##   vcdExtra     3         24      1    15

Choose one or two data sets from this list, and examine their help files (e.g., help(Arthritis) or ?Arthritis). You can use, e.g., example(Arthritis) to run the R code for a given example.

help(Arthritis)

## starting httpd help server ... done

example(Arthritis)

## 
## Arthrt> data("Arthritis")
## 
## Arthrt> art <- xtabs(~ Treatment + Improved, data = Arthritis, subset = Sex == "Female")
## 
## Arthrt> art
##          Improved
## Treatment None Some Marked
##   Placebo   19    7      6
##   Treated    6    5     16
## 
## Arthrt> mosaic(art, gp = shading_Friendly)

## 
## Arthrt> mosaic(art, gp = shading_max)

help(TV)
example(TV)

## 
## TV> data(TV)
## 
## TV> structable(TV)
##                   Time 8:00 8:15 8:30 8:45 9:00 9:15 9:30 9:45 10:00 10:15 10:30
## Day       Network                                                               
## Monday    ABC           146  151  156   83  325  350  386  340   352   280   278
##           CBS           337  293  304  233  311  251  241  164   252   265   272
##           NBC           263  219  236  140  226  235  239  246   279   263   283
## Tuesday   ABC           244  181  231  205  385  283  345  192   329   351   364
##           CBS           173  180  184  109  218  235  256  250   274   263   261
##           NBC           315  254  280  241  370  214  195  111   188   190   210
## Wednesday ABC           233  161  194  156  339  264  279  140   237   228   203
##           CBS           158  126  207   59   98  103  122   86   109   105   110
##           NBC           134  146  166   66  194  230  264  143   274   289   306
## Thursday  ABC           174  183  197  181  187  198  211   86   110   122   117
##           CBS           196  185  195  104  106  116  116   47   102    84    84
##           NBC           515  463  472  477  590  473  446  349   649   705   747
## Friday    ABC           294  281  305  239  278  246  245  138   246   232   233
##           CBS           130  144  154   81  129  153  136  126   138   136   152
##           NBC           195  220  248  160  172  164  169   85   183   198   204
## 
## TV> doubledecker(TV)

## 
## TV> # reduce number of levels of Time
## TV> TV.df <- as.data.frame.table(TV)
## 
## TV> levels(TV.df$Time) <- rep(c("8:00-8:59", "9:00-9:59", "10:00-10:44"), c(4, 4, 3))
## 
## TV> TV2 <- xtabs(Freq ~ Day + Time + Network, TV.df)
## 
## TV> # re-label for mosaic display
## TV> levels(TV.df$Time) <- c("8", "9", "10")
## 
## TV> # fit mode of joint independence, showing association of Network with Day*Time
## TV> mosaic(~ Day + Network + Time, data = TV.df, expected = ~ Day:Time + Network, legend = FALSE)

## 
## TV> # with doubledecker arrangement
## TV> mosaic(~ Day + Network + Time, data = TV.df, expected = ~ Day:Time + Network,
## TV+   split = c(TRUE, TRUE, FALSE), spacing = spacing_highlighting, legend = FALSE)

p. 61 #2.3

Find the total number of cases contained in this table.

summary(UCBAdmissions)

## Number of cases in table: 4526 
## Number of factors: 3 
## Test for independence of all factors:
##  Chisq = 2000.3, df = 16, p-value = 0

For each department, find the total number of applicants.

margin.table(UCBAdmissions,3)

## Dept
##   A   B   C   D   E   F 
## 933 585 918 792 584 714

For each department, find the overall proportion of applicants who were admitted.

data1 = UCBAdmissions[,1,]+UCBAdmissions[,2,]
prop.table(data1,2)

##           Dept
## Admit               A          B          C          D          E
##   Admitted 0.64415863 0.63247863 0.35076253 0.33964646 0.25171233
##   Rejected 0.35584137 0.36752137 0.64923747 0.66035354 0.74828767
##           Dept
## Admit               F
##   Admitted 0.06442577
##   Rejected 0.93557423

Construct a tabular display of department (rows) and gender (columns), showing the proportion of applicants in each cell who were admitted relative to the total applicants in that cell.

data2 = aperm(UCBAdmissions, c(3,2,1))
prop.table(data2)

## , , Admit = Admitted
## 
##     Gender
## Dept        Male      Female
##    A 0.113124171 0.019664163
##    B 0.077993814 0.003756076
##    C 0.026513478 0.044631021
##    D 0.030490499 0.028943880
##    E 0.011710119 0.020768891
##    F 0.004860804 0.005302696
## 
## , , Admit = Rejected
## 
##     Gender
## Dept        Male      Female
##    A 0.069155988 0.004197967
##    B 0.045735749 0.001767565
##    C 0.045293858 0.086389748
##    D 0.061643836 0.053910738
##    E 0.030490499 0.066062749
##    F 0.077551922 0.070039770

p. 61 #2.4 a, c, e

Find the total number of cases represented in this table.

sum(DanishWelfare$Freq)

## [1] 5144

Convert this data frame to table form, DanishWelfare.tab, a 4-way array containing the frequencies with appropriate variable names and level names.

DanishWelfare_tab <- xtabs(Freq ~., data = DanishWelfare)
str(DanishWelfare_tab)

##  'xtabs' num [1:3, 1:4, 1:3, 1:5] 1 3 2 8 1 3 2 5 2 42 ...
##  - attr(*, "dimnames")=List of 4
##   ..$ Alcohol: chr [1:3] "<1" "1-2" ">2"
##   ..$ Income : chr [1:4] "0-50" "50-100" "100-150" ">150"
##   ..$ Status : chr [1:3] "Widow" "Married" "Unmarried"
##   ..$ Urban  : chr [1:5] "Copenhagen" "SubCopenhagen" "LargeCity" "City" ...
##  - attr(*, "call")= language xtabs(formula = Freq ~ ., data = DanishWelfare)

Use structable () or ftable () to produce a pleasing flattened display of the frequencies in the 4-way table. Choose the variables used as row and column variables to make it easier to compare levels of Alcohol across the other factors.

ftable(xtabs(Freq ~., data = DanishWelfare))

##                           Urban Copenhagen SubCopenhagen LargeCity City Country
## Alcohol Income  Status                                                         
## <1      0-50    Widow                    1             4         1    8       6
##                 Married                 14             8        41  100     175
##                 Unmarried                6             1         2    6       9
##         50-100  Widow                    8             2         7   14       5
##                 Married                 42            51        62  234     255
##                 Unmarried                7             5         9   20      27
##         100-150 Widow                    2             3         1    5       2
##                 Married                 21            30        23   87      77
##                 Unmarried                3             2         1   12       4
##         >150    Widow                   42            29        17   95      46
##                 Married                 24            30        50  167     232
##                 Unmarried               33            24        15   64      68
## 1-2     0-50    Widow                    3             0         1    4       2
##                 Married                 15             7        15   25      48
##                 Unmarried                2             3         9    9       7
##         50-100  Widow                    1             1         3    8       4
##                 Married                 39            59        68  172     143
##                 Unmarried               12             3        11   20      23
##         100-150 Widow                    5             4         1    9       4
##                 Married                 32            68        43  128      86
##                 Unmarried                6            10         5   21      15
##         >150    Widow                   26            34        14   48      24
##                 Married                 43            76        70  198     136
##                 Unmarried               36            23        48   89      64
## >2      0-50    Widow                    2             0         2    1       0
##                 Married                  1             2         2    7       7
##                 Unmarried                3             0         1    5       1
##         50-100  Widow                    3             0         2    1       3
##                 Married                 14            21        14   38      35
##                 Unmarried                2             0         3   12      13
##         100-150 Widow                    2             1         1    1       0
##                 Married                 20            31        10   36      21
##                 Unmarried                0             2         3    9       7
##         >150    Widow                   21            13         5   20       8
##                 Married                 23            47        21   53      36
##                 Unmarried               38            20        13   39      26

p. 62 #2.5 a, b, c

#code from text
data("UKSoccer", package = "vcd") 
ftable(UKSoccer)

##      Away  0  1  2  3  4
## Home                    
## 0         27 29 10  8  2
## 1         59 53 14 12  4
## 2         28 32 14 12  4
## 3         19 14  7  4  1
## 4          7  8 10  2  0

1. Verify that the total number of games represented in this table is 380.

sum(UKSoccer)

## [1] 380

Find the marginal total of the number of goals scored by each of the home and away teams.

margin.table(UKSoccer,1)

## Home
##   0   1   2   3   4 
##  76 142  90  45  27

margin.table(UKSoccer,2)

## Away
##   0   1   2   3   4 
## 140 136  55  38  11

Express each of the marginal totals as proportions.

prop.table(margin.table(UKSoccer,1))

## Home
##          0          1          2          3          4 
## 0.20000000 0.37368421 0.23684211 0.11842105 0.07105263

prop.table(margin.table(UKSoccer,2))

## Away
##          0          1          2          3          4 
## 0.36842105 0.35789474 0.14473684 0.10000000 0.02894737

Run the code below and notice there is a data frame entitled SpaceShuttle. Using the R help, read about the details of this data frame. That is, familiarize yourself with the context and understand the meaning of the different rows.

library(vcd)
library(vcdExtra)

ds <- datasets(package = c("vcd", "vcdExtra"))
str(ds)

## 'data.frame':    76 obs. of  5 variables:
##  $ Package: chr  "vcd" "vcd" "vcd" "vcd" ...
##  $ Item   : chr  "Arthritis" "Baseball" "BrokenMarriage" "Bundesliga" ...
##  $ class  : chr  "data.frame" "data.frame" "data.frame" "data.frame" ...
##  $ dim    : chr  "84x5" "322x25" "20x4" "14018x7" ...
##  $ Title  : chr  "Arthritis Treatment Data" "Baseball Data" "Broken Marriage Data" "Ergebnisse der Fussball-Bundesliga" ...

View(ds)

Using the structable() function, create a “flat” table that has the Damage Index on the columns and whether the O-ring failed and how many failures on the rows.

structable(Damage ~ Fail + nFailures, data = SpaceShuttle)

##                Damage  0  2  4 11
## Fail nFailures                   
## no   0                15  0  1  0
##      1                 0  0  0  0
##      2                 0  0  0  0
## yes  0                 0  0  0  0
##      1                 0  1  4  0
##      2                 0  0  1  1

Construct the same formatted table that you did in part a, but now use the xtabs() and ftable() functions.

ftable(Damage ~ Fail + nFailures, data = SpaceShuttle)

##                Damage  0  2  4 11
## Fail nFailures                   
## no   0                15  0  1  0
##      1                 0  0  0  0
##      2                 0  0  0  0
## yes  0                 0  0  0  0
##      1                 0  1  4  0
##      2                 0  0  1  1

Getting Familiar with Categorical Data in R

Nan Li

2019-11-13

2.1 p.p. 60-61