R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

3.4.1

Use the read.table function to read in the syphilis1981c.txt data table (found on git hub). Create a 3 dimensional array using both the table and the xtabs function.

std <- read.table('syphilis89c.txt', sep = ',', header = T)
str(std)
## 'data.frame':    44081 obs. of  3 variables:
##  $ Sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Race: Factor w/ 3 levels "Black","Other",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Age : Factor w/ 8 levels "<=14",">55","15-19",..: 1 1 3 3 3 3 3 3 3 3 ...
head(std)
##    Sex  Race   Age
## 1 Male White  <=14
## 2 Male White  <=14
## 3 Male White 15-19
## 4 Male White 15-19
## 5 Male White 15-19
## 6 Male White 15-19
lapply(std, table)
## $Sex
## 
## Female   Male 
##  18075  26006 
## 
## $Race
## 
## Black Other White 
## 35508  3956  4617 
## 
## $Age
## 
##  <=14   >55 15-19 20-24 25-29 30-34 35-44 45-54 
##   230  1278  4378 10405  9610  8648  6901  2631
table(std$Race, std$Age, std$Sex)
## , ,  = Female
## 
##        
##         <=14  >55 15-19 20-24 25-29 30-34 35-44 45-54
##   Black  165   92  2257  4503  3590  2628  1505   392
##   Other   11   15   158   307   283   167   149    40
##   White   14   24   253   475   433   316   243    55
## 
## , ,  = Male
## 
##        
##         <=14  >55 15-19 20-24 25-29 30-34 35-44 45-54
##   Black   31  823  1412  4059  4121  4453  3858  1619
##   Other    7  108   210   654   633   520   492   202
##   White    2  216    88   407   550   564   654   323
xtabs(~ Race + Age + Sex, data = std)
## , , Sex = Female
## 
##        Age
## Race    <=14  >55 15-19 20-24 25-29 30-34 35-44 45-54
##   Black  165   92  2257  4503  3590  2628  1505   392
##   Other   11   15   158   307   283   167   149    40
##   White   14   24   253   475   433   316   243    55
## 
## , , Sex = Male
## 
##        Age
## Race    <=14  >55 15-19 20-24 25-29 30-34 35-44 45-54
##   Black   31  823  1412  4059  4121  4453  3858  1619
##   Other    7  108   210   654   633   520   492   202
##   White    2  216    88   407   550   564   654   323

Now attach the data fram using the attach function. Create the same 3 dimensional arry using both the table or the xtabs function.

attach(std)
table(Race, Age, Sex)
## , , Sex = Female
## 
##        Age
## Race    <=14  >55 15-19 20-24 25-29 30-34 35-44 45-54
##   Black  165   92  2257  4503  3590  2628  1505   392
##   Other   11   15   158   307   283   167   149    40
##   White   14   24   253   475   433   316   243    55
## 
## , , Sex = Male
## 
##        Age
## Race    <=14  >55 15-19 20-24 25-29 30-34 35-44 45-54
##   Black   31  823  1412  4059  4121  4453  3858  1619
##   Other    7  108   210   654   633   520   492   202
##   White    2  216    88   407   550   564   654   323
xtabs(~ Race + Age + Sex)
## , , Sex = Female
## 
##        Age
## Race    <=14  >55 15-19 20-24 25-29 30-34 35-44 45-54
##   Black  165   92  2257  4503  3590  2628  1505   392
##   Other   11   15   158   307   283   167   149    40
##   White   14   24   253   475   433   316   243    55
## 
## , , Sex = Male
## 
##        Age
## Race    <=14  >55 15-19 20-24 25-29 30-34 35-44 45-54
##   Black   31  823  1412  4059  4121  4453  3858  1619
##   Other    7  108   210   654   633   520   492   202
##   White    2  216    88   407   550   564   654   323
detach(std)

3.4.2 Use the apply function to get the marginal totals for the syphilis 3 dimensional array

First, we create the tables

std_table <- xtabs(~Race + Age + Sex, data = std)
summary(std_table)
## Call: xtabs(formula = ~Race + Age + Sex, data = std)
## Number of cases in table: 44081 
## Number of factors: 3 
## Test for independence of all factors:
##  Chisq = 3728, df = 37, p-value = 0
std_table
## , , Sex = Female
## 
##        Age
## Race    <=14  >55 15-19 20-24 25-29 30-34 35-44 45-54
##   Black  165   92  2257  4503  3590  2628  1505   392
##   Other   11   15   158   307   283   167   149    40
##   White   14   24   253   475   433   316   243    55
## 
## , , Sex = Male
## 
##        Age
## Race    <=14  >55 15-19 20-24 25-29 30-34 35-44 45-54
##   Black   31  823  1412  4059  4121  4453  3858  1619
##   Other    7  108   210   654   633   520   492   202
##   White    2  216    88   407   550   564   654   323
age_race <- apply(std_table, c(1,2), sum)
age_race
##        Age
## Race    <=14 >55 15-19 20-24 25-29 30-34 35-44 45-54
##   Black  196 915  3669  8562  7711  7081  5363  2011
##   Other   18 123   368   961   916   687   641   242
##   White   16 240   341   882   983   880   897   378
race_sex <- apply(std_table, c(1,3), sum) 
race_sex
##        Sex
## Race    Female  Male
##   Black  15132 20376
##   Other   1130  2826
##   White   1813  2804
age_sex <- apply(std_table, c(2,3), sum)
age_sex
##        Sex
## Age     Female Male
##   <=14     190   40
##   >55      131 1147
##   15-19   2668 1710
##   20-24   5285 5120
##   25-29   4306 5304
##   30-34   3111 5537
##   35-44   1897 5004
##   45-54    487 2144
race <- apply(std_table, 1, sum); race
## Black Other White 
## 35508  3956  4617
age <- apply(std_table, 2, sum); age
##  <=14   >55 15-19 20-24 25-29 30-34 35-44 45-54 
##   230  1278  4378 10405  9610  8648  6901  2631
sex <- apply(std_table, 3, sum); sex
## Female   Male 
##  18075  26006

3.4.3: Use the Sweep and the apply functions to get marginal and joint distributions for the syphylis 3 Dimensionl Array.

We start by showing the maginal tables for M/F distributed by Race and Age. I printed the original table for reference, so that we can see that the raw numbers. I also multipled the marginals by 100 and limited the digits to show them cleanly in percentage form.

std_table
## , , Sex = Female
## 
##        Age
## Race    <=14  >55 15-19 20-24 25-29 30-34 35-44 45-54
##   Black  165   92  2257  4503  3590  2628  1505   392
##   Other   11   15   158   307   283   167   149    40
##   White   14   24   253   475   433   316   243    55
## 
## , , Sex = Male
## 
##        Age
## Race    <=14  >55 15-19 20-24 25-29 30-34 35-44 45-54
##   Black   31  823  1412  4059  4121  4453  3858  1619
##   Other    7  108   210   654   633   520   492   202
##   White    2  216    88   407   550   564   654   323
row_total <- apply(std_table, c(1,2), sum) 
race_age_by_sex_marginals <- sweep(std_table, c(1,2), row_total, '/')
race_age_by_sex_marginals = race_age_by_sex_marginals * 100
round(race_age_by_sex_marginals, digits = 1)
## , , Sex = Female
## 
##        Age
## Race    <=14  >55 15-19 20-24 25-29 30-34 35-44 45-54
##   Black 84.2 10.1  61.5  52.6  46.6  37.1  28.1  19.5
##   Other 61.1 12.2  42.9  31.9  30.9  24.3  23.2  16.5
##   White 87.5 10.0  74.2  53.9  44.0  35.9  27.1  14.6
## 
## , , Sex = Male
## 
##        Age
## Race    <=14  >55 15-19 20-24 25-29 30-34 35-44 45-54
##   Black 15.8 89.9  38.5  47.4  53.4  62.9  71.9  80.5
##   Other 38.9 87.8  57.1  68.1  69.1  75.7  76.8  83.5
##   White 12.5 90.0  25.8  46.1  56.0  64.1  72.9  85.4

We can then do the same for race and sex seperated by age.

row_total <- apply(std_table, c(1,3), sum) 
race_sex_by_age_marginals <- sweep(std_table, c(1,3), row_total, '/')
race_sex_by_age_marginals = race_sex_by_age_marginals * 100
round(race_sex_by_age_marginals, digits = 1)
## , , Sex = Female
## 
##        Age
## Race    <=14  >55 15-19 20-24 25-29 30-34 35-44 45-54
##   Black  1.1  0.6  14.9  29.8  23.7  17.4   9.9   2.6
##   Other  1.0  1.3  14.0  27.2  25.0  14.8  13.2   3.5
##   White  0.8  1.3  14.0  26.2  23.9  17.4  13.4   3.0
## 
## , , Sex = Male
## 
##        Age
## Race    <=14  >55 15-19 20-24 25-29 30-34 35-44 45-54
##   Black  0.2  4.0   6.9  19.9  20.2  21.9  18.9   7.9
##   Other  0.2  3.8   7.4  23.1  22.4  18.4  17.4   7.1
##   White  0.1  7.7   3.1  14.5  19.6  20.1  23.3  11.5

And Finally on the Sex and Age Seperated by Race

row_total <- apply(std_table, c(2,3), sum) 
sex_age_by_race_marginals <- sweep(std_table, c(2,3), row_total, '/')
sex_age_by_race_marginals = sex_age_by_race_marginals * 100
round(sex_age_by_race_marginals, digits = 1)
## , , Sex = Female
## 
##        Age
## Race    <=14  >55 15-19 20-24 25-29 30-34 35-44 45-54
##   Black 86.8 70.2  84.6  85.2  83.4  84.5  79.3  80.5
##   Other  5.8 11.5   5.9   5.8   6.6   5.4   7.9   8.2
##   White  7.4 18.3   9.5   9.0  10.1  10.2  12.8  11.3
## 
## , , Sex = Male
## 
##        Age
## Race    <=14  >55 15-19 20-24 25-29 30-34 35-44 45-54
##   Black 77.5 71.8  82.6  79.3  77.7  80.4  77.1  75.5
##   Other 17.5  9.4  12.3  12.8  11.9   9.4   9.8   9.4
##   White  5.0 18.8   5.1   7.9  10.4  10.2  13.1  15.1

3.4.4: Review and read hte grouplevel, tabular data set of the primary and secondary syphilis cases in the United States in 1989. Use the rep function on the data fram fields to recreaate the indiviudal-level data frame with over 40,000 observations.

std89b = read.csv("syph89b.txt", sep = ",", header = T)
head(std89b)
##      Sex  Race  Age Freq
## 1   Male White <=14    2
## 2 Female White <=14   14
## 3   Male Black <=14   31
## 4 Female Black <=14  165
## 5   Male Other <=14    7
## 6 Female Other <=14   11
sex <- rep(std89b$Sex, std89b$Freq)
race <- rep(std89b$Race, std89b$Freq)
age <- rep(std89b$Age, std89b$Freq)
std89bdf <- data.frame(sex, race, age)
summary(std89bdf)
##      sex           race            age       
##  Female:18075   Black:35508   20-24  :10405  
##  Male  :26006   Other: 3956   25-29  : 9610  
##                 White: 4617   30-34  : 8648  
##                               35-44  : 6901  
##                               15-19  : 4378  
##                               45-54  : 2631  
##                               (Other): 1508

3.4.5 Working with population data estimates can be challenging becausue of the amount of data manipulation. Study the 200 population estimates for California Counties. Now study and implement this R code. For each expression or group of expressnions, explain in words what this R Code is doing. Be sure to display the intermediate objects to understand each step.

cty <- scan("calcounty.txt", what="")
head(cty)
## [1] "Alameda"   "Alpine"    "Amador"    "Butte"     "Calaveras" "Colusa"
calpop <- read.csv("CalCounties2000.txt", header = T)

head(calpop)
##   County Year Sex Age White Hispanic Asian Pacific.Islander Black
## 1      1 2000   F   0  2751     2910  1921               63  1346
## 2      1 2000   F   1  2673     2837  1820               70  1343
## 3      1 2000   F   2  2654     2770  1930               64  1442
## 4      1 2000   F   3  2646     2798  1998               64  1455
## 5      1 2000   F   4  2780     2762  2029               85  1558
## 6      1 2000   F   5  2826     2738  1998               93  1632
##   American.Indian Multirace
## 1              32       711
## 2              37       574
## 3              37       579
## 4              39       588
## 5              38       575
## 6              45       576
for(i in 1:length(cty)){
  calpop$County[calpop$County==i] <- cty[i]
}

head(calpop)
##    County Year Sex Age White Hispanic Asian Pacific.Islander Black
## 1 Alameda 2000   F   0  2751     2910  1921               63  1346
## 2 Alameda 2000   F   1  2673     2837  1820               70  1343
## 3 Alameda 2000   F   2  2654     2770  1930               64  1442
## 4 Alameda 2000   F   3  2646     2798  1998               64  1455
## 5 Alameda 2000   F   4  2780     2762  2029               85  1558
## 6 Alameda 2000   F   5  2826     2738  1998               93  1632
##   American.Indian Multirace
## 1              32       711
## 2              37       574
## 3              37       579
## 4              39       588
## 5              38       575
## 6              45       576

This first series of commands does the following:

It imports a list of counties, and then a wider database that gives demographic information for the California Counties. Next, the for loop replaces County numbers with county names using the list of counties document we scanned in earlier. For example, in the later version of calpop, you replace 1 with Alameda County, the county that alphabetically comes first.

calpop$Agecat <- cut(calpop$Age, c(0,20,45,65,100), include.lowest = T, right = F); head(calpop)
##    County Year Sex Age White Hispanic Asian Pacific.Islander Black
## 1 Alameda 2000   F   0  2751     2910  1921               63  1346
## 2 Alameda 2000   F   1  2673     2837  1820               70  1343
## 3 Alameda 2000   F   2  2654     2770  1930               64  1442
## 4 Alameda 2000   F   3  2646     2798  1998               64  1455
## 5 Alameda 2000   F   4  2780     2762  2029               85  1558
## 6 Alameda 2000   F   5  2826     2738  1998               93  1632
##   American.Indian Multirace Agecat
## 1              32       711 [0,20)
## 2              37       574 [0,20)
## 3              37       579 [0,20)
## 4              39       588 [0,20)
## 5              38       575 [0,20)
## 6              45       576 [0,20)

This second set of commands creates a new column, Agecat, which categories age into values rather than an integer representation.

calpop$AsianPI <- calpop$Asian + calpop$Pacific.Islander; head(calpop)
##    County Year Sex Age White Hispanic Asian Pacific.Islander Black
## 1 Alameda 2000   F   0  2751     2910  1921               63  1346
## 2 Alameda 2000   F   1  2673     2837  1820               70  1343
## 3 Alameda 2000   F   2  2654     2770  1930               64  1442
## 4 Alameda 2000   F   3  2646     2798  1998               64  1455
## 5 Alameda 2000   F   4  2780     2762  2029               85  1558
## 6 Alameda 2000   F   5  2826     2738  1998               93  1632
##   American.Indian Multirace Agecat AsianPI
## 1              32       711 [0,20)    1984
## 2              37       574 [0,20)    1890
## 3              37       579 [0,20)    1994
## 4              39       588 [0,20)    2062
## 5              38       575 [0,20)    2114
## 6              45       576 [0,20)    2091

This combines Asian and Pacific Islanders into 1 new category, AsianPI

names(calpop)[c(6,9,10)] = c("Latino", "AfrAmer", "AmerInd"); head(calpop)
##    County Year Sex Age White Latino Asian Pacific.Islander AfrAmer AmerInd
## 1 Alameda 2000   F   0  2751   2910  1921               63    1346      32
## 2 Alameda 2000   F   1  2673   2837  1820               70    1343      37
## 3 Alameda 2000   F   2  2654   2770  1930               64    1442      37
## 4 Alameda 2000   F   3  2646   2798  1998               64    1455      39
## 5 Alameda 2000   F   4  2780   2762  2029               85    1558      38
## 6 Alameda 2000   F   5  2826   2738  1998               93    1632      45
##   Multirace Agecat AsianPI
## 1       711 [0,20)    1984
## 2       574 [0,20)    1890
## 3       579 [0,20)    1994
## 4       588 [0,20)    2062
## 5       575 [0,20)    2114
## 6       576 [0,20)    2091

This command shortens the names of some columns to make it easier to fit.

baindex <- calpop$County =="Alameda" | calpop$County == "San Francisco"
bapop <- calpop[baindex,]
head(bapop)
##    County Year Sex Age White Latino Asian Pacific.Islander AfrAmer AmerInd
## 1 Alameda 2000   F   0  2751   2910  1921               63    1346      32
## 2 Alameda 2000   F   1  2673   2837  1820               70    1343      37
## 3 Alameda 2000   F   2  2654   2770  1930               64    1442      37
## 4 Alameda 2000   F   3  2646   2798  1998               64    1455      39
## 5 Alameda 2000   F   4  2780   2762  2029               85    1558      38
## 6 Alameda 2000   F   5  2826   2738  1998               93    1632      45
##   Multirace Agecat AsianPI
## 1       711 [0,20)    1984
## 2       574 [0,20)    1890
## 3       579 [0,20)    1994
## 4       588 [0,20)    2062
## 5       575 [0,20)    2114
## 6       576 [0,20)    2091

This condenses the values to just SF country and Alameda County, representing some of the Bay Area.

agelabs <- names(table(bapop$Agecat))
sexlabs <- c("Female", "Male")
racen <- c("White", "AfrAmer", "AsianPI", "Latino", "Multirace", "AmerInd")
ctylabs <- names(table(bapop$County))

bapop2 <- aggregate(bapop[, racen], list(Agecat = bapop$Agecat, Sex = bapop$Sex, County = bapop$County), sum)
bapop2
##      Agecat Sex        County  White AfrAmer AsianPI Latino Multirace
## 1    [0,20)   F       Alameda  58160   31765   40653  49738     10120
## 2   [20,45)   F       Alameda 112326   44437   72923  58553      7658
## 3   [45,65)   F       Alameda  82205   24948   33236  18534      2922
## 4  [65,100]   F       Alameda  49762   12834   16004   7548      1014
## 5    [0,20)   M       Alameda  61446   32277   42922  53097     10102
## 6   [20,45)   M       Alameda 115745   36976   69053  69233      6795
## 7   [45,65)   M       Alameda  81332   20737   29841  17402      2506
## 8  [65,100]   M       Alameda  33994    8087   11855   5416       711
## 9    [0,20)   F San Francisco  14355    6986   23265  13251      2940
## 10  [20,45)   F San Francisco  85766   10284   52479  23458      3656
## 11  [45,65)   F San Francisco  35617    6890   31478   9184      1144
## 12 [65,100]   F San Francisco  27215    5172   23044   5773       554
## 13   [0,20)   M San Francisco  14881    6959   24541  14480      2851
## 14  [20,45)   M San Francisco 105798   11111   48379  31605      3766
## 15  [45,65)   M San Francisco  43694    7352   26404   8674      1220
## 16 [65,100]   M San Francisco  20072    3329   17190   3428       450
##    AmerInd
## 1      839
## 2     1401
## 3      822
## 4      246
## 5      828
## 6     1263
## 7      687
## 8      156
## 9      173
## 10     526
## 11     282
## 12     121
## 13     165
## 14     782
## 15     354
## 16      76

This takes the data and applies it to a new data frame, bapop2. We see in the final output Age, Sex, County, as well as other data for Alemeda and SF county.

tmp <- as.matrix(cbind(bapop2[1:4, racen], bapop2[5:8, racen], bapop2[9:12, racen], bapop2[13:16, racen]))

bapop3 <- array(tmp, c(4,6,2,2))
dimnames(bapop3) <- list(agelabs, racen, sexlabs, ctylabs)
bapop3
## , , Female, Alameda
## 
##           White AfrAmer AsianPI Latino Multirace AmerInd
## [0,20)    58160   31765   40653  49738     10120     839
## [20,45)  112326   44437   72923  58553      7658    1401
## [45,65)   82205   24948   33236  18534      2922     822
## [65,100]  49762   12834   16004   7548      1014     246
## 
## , , Male, Alameda
## 
##           White AfrAmer AsianPI Latino Multirace AmerInd
## [0,20)    61446   32277   42922  53097     10102     828
## [20,45)  115745   36976   69053  69233      6795    1263
## [45,65)   81332   20737   29841  17402      2506     687
## [65,100]  33994    8087   11855   5416       711     156
## 
## , , Female, San Francisco
## 
##          White AfrAmer AsianPI Latino Multirace AmerInd
## [0,20)   14355    6986   23265  13251      2940     173
## [20,45)  85766   10284   52479  23458      3656     526
## [45,65)  35617    6890   31478   9184      1144     282
## [65,100] 27215    5172   23044   5773       554     121
## 
## , , Male, San Francisco
## 
##           White AfrAmer AsianPI Latino Multirace AmerInd
## [0,20)    14881    6959   24541  14480      2851     165
## [20,45)  105798   11111   48379  31605      3766     782
## [45,65)   43694    7352   26404   8674      1220     354
## [65,100]  20072    3329   17190   3428       450      76

This creates a final array, seperated along 4 values – Age, Ethnicity, Sex, and County of Residence.