This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
3.4.1
Use the read.table function to read in the syphilis1981c.txt data table (found on git hub). Create a 3 dimensional array using both the table and the xtabs function.
std <- read.table('syphilis89c.txt', sep = ',', header = T)
str(std)
## 'data.frame': 44081 obs. of 3 variables:
## $ Sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
## $ Race: Factor w/ 3 levels "Black","Other",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Age : Factor w/ 8 levels "<=14",">55","15-19",..: 1 1 3 3 3 3 3 3 3 3 ...
head(std)
## Sex Race Age
## 1 Male White <=14
## 2 Male White <=14
## 3 Male White 15-19
## 4 Male White 15-19
## 5 Male White 15-19
## 6 Male White 15-19
lapply(std, table)
## $Sex
##
## Female Male
## 18075 26006
##
## $Race
##
## Black Other White
## 35508 3956 4617
##
## $Age
##
## <=14 >55 15-19 20-24 25-29 30-34 35-44 45-54
## 230 1278 4378 10405 9610 8648 6901 2631
table(std$Race, std$Age, std$Sex)
## , , = Female
##
##
## <=14 >55 15-19 20-24 25-29 30-34 35-44 45-54
## Black 165 92 2257 4503 3590 2628 1505 392
## Other 11 15 158 307 283 167 149 40
## White 14 24 253 475 433 316 243 55
##
## , , = Male
##
##
## <=14 >55 15-19 20-24 25-29 30-34 35-44 45-54
## Black 31 823 1412 4059 4121 4453 3858 1619
## Other 7 108 210 654 633 520 492 202
## White 2 216 88 407 550 564 654 323
xtabs(~ Race + Age + Sex, data = std)
## , , Sex = Female
##
## Age
## Race <=14 >55 15-19 20-24 25-29 30-34 35-44 45-54
## Black 165 92 2257 4503 3590 2628 1505 392
## Other 11 15 158 307 283 167 149 40
## White 14 24 253 475 433 316 243 55
##
## , , Sex = Male
##
## Age
## Race <=14 >55 15-19 20-24 25-29 30-34 35-44 45-54
## Black 31 823 1412 4059 4121 4453 3858 1619
## Other 7 108 210 654 633 520 492 202
## White 2 216 88 407 550 564 654 323
Now attach the data fram using the attach function. Create the same 3 dimensional arry using both the table or the xtabs function.
attach(std)
table(Race, Age, Sex)
## , , Sex = Female
##
## Age
## Race <=14 >55 15-19 20-24 25-29 30-34 35-44 45-54
## Black 165 92 2257 4503 3590 2628 1505 392
## Other 11 15 158 307 283 167 149 40
## White 14 24 253 475 433 316 243 55
##
## , , Sex = Male
##
## Age
## Race <=14 >55 15-19 20-24 25-29 30-34 35-44 45-54
## Black 31 823 1412 4059 4121 4453 3858 1619
## Other 7 108 210 654 633 520 492 202
## White 2 216 88 407 550 564 654 323
xtabs(~ Race + Age + Sex)
## , , Sex = Female
##
## Age
## Race <=14 >55 15-19 20-24 25-29 30-34 35-44 45-54
## Black 165 92 2257 4503 3590 2628 1505 392
## Other 11 15 158 307 283 167 149 40
## White 14 24 253 475 433 316 243 55
##
## , , Sex = Male
##
## Age
## Race <=14 >55 15-19 20-24 25-29 30-34 35-44 45-54
## Black 31 823 1412 4059 4121 4453 3858 1619
## Other 7 108 210 654 633 520 492 202
## White 2 216 88 407 550 564 654 323
detach(std)
3.4.2 Use the apply function to get the marginal totals for the syphilis 3 dimensional array
std_table <- xtabs(~Race + Age + Sex, data = std)
summary(std_table)
## Call: xtabs(formula = ~Race + Age + Sex, data = std)
## Number of cases in table: 44081
## Number of factors: 3
## Test for independence of all factors:
## Chisq = 3728, df = 37, p-value = 0
std_table
## , , Sex = Female
##
## Age
## Race <=14 >55 15-19 20-24 25-29 30-34 35-44 45-54
## Black 165 92 2257 4503 3590 2628 1505 392
## Other 11 15 158 307 283 167 149 40
## White 14 24 253 475 433 316 243 55
##
## , , Sex = Male
##
## Age
## Race <=14 >55 15-19 20-24 25-29 30-34 35-44 45-54
## Black 31 823 1412 4059 4121 4453 3858 1619
## Other 7 108 210 654 633 520 492 202
## White 2 216 88 407 550 564 654 323
age_race <- apply(std_table, c(1,2), sum)
age_race
## Age
## Race <=14 >55 15-19 20-24 25-29 30-34 35-44 45-54
## Black 196 915 3669 8562 7711 7081 5363 2011
## Other 18 123 368 961 916 687 641 242
## White 16 240 341 882 983 880 897 378
race_sex <- apply(std_table, c(1,3), sum)
race_sex
## Sex
## Race Female Male
## Black 15132 20376
## Other 1130 2826
## White 1813 2804
age_sex <- apply(std_table, c(2,3), sum)
age_sex
## Sex
## Age Female Male
## <=14 190 40
## >55 131 1147
## 15-19 2668 1710
## 20-24 5285 5120
## 25-29 4306 5304
## 30-34 3111 5537
## 35-44 1897 5004
## 45-54 487 2144
race <- apply(std_table, 1, sum); race
## Black Other White
## 35508 3956 4617
age <- apply(std_table, 2, sum); age
## <=14 >55 15-19 20-24 25-29 30-34 35-44 45-54
## 230 1278 4378 10405 9610 8648 6901 2631
sex <- apply(std_table, 3, sum); sex
## Female Male
## 18075 26006
3.4.3: Use the Sweep and the apply functions to get marginal and joint distributions for the syphylis 3 Dimensionl Array.
We start by showing the maginal tables for M/F distributed by Race and Age. I printed the original table for reference, so that we can see that the raw numbers. I also multipled the marginals by 100 and limited the digits to show them cleanly in percentage form.
std_table
## , , Sex = Female
##
## Age
## Race <=14 >55 15-19 20-24 25-29 30-34 35-44 45-54
## Black 165 92 2257 4503 3590 2628 1505 392
## Other 11 15 158 307 283 167 149 40
## White 14 24 253 475 433 316 243 55
##
## , , Sex = Male
##
## Age
## Race <=14 >55 15-19 20-24 25-29 30-34 35-44 45-54
## Black 31 823 1412 4059 4121 4453 3858 1619
## Other 7 108 210 654 633 520 492 202
## White 2 216 88 407 550 564 654 323
row_total <- apply(std_table, c(1,2), sum)
race_age_by_sex_marginals <- sweep(std_table, c(1,2), row_total, '/')
race_age_by_sex_marginals = race_age_by_sex_marginals * 100
round(race_age_by_sex_marginals, digits = 1)
## , , Sex = Female
##
## Age
## Race <=14 >55 15-19 20-24 25-29 30-34 35-44 45-54
## Black 84.2 10.1 61.5 52.6 46.6 37.1 28.1 19.5
## Other 61.1 12.2 42.9 31.9 30.9 24.3 23.2 16.5
## White 87.5 10.0 74.2 53.9 44.0 35.9 27.1 14.6
##
## , , Sex = Male
##
## Age
## Race <=14 >55 15-19 20-24 25-29 30-34 35-44 45-54
## Black 15.8 89.9 38.5 47.4 53.4 62.9 71.9 80.5
## Other 38.9 87.8 57.1 68.1 69.1 75.7 76.8 83.5
## White 12.5 90.0 25.8 46.1 56.0 64.1 72.9 85.4
We can then do the same for race and sex seperated by age.
row_total <- apply(std_table, c(1,3), sum)
race_sex_by_age_marginals <- sweep(std_table, c(1,3), row_total, '/')
race_sex_by_age_marginals = race_sex_by_age_marginals * 100
round(race_sex_by_age_marginals, digits = 1)
## , , Sex = Female
##
## Age
## Race <=14 >55 15-19 20-24 25-29 30-34 35-44 45-54
## Black 1.1 0.6 14.9 29.8 23.7 17.4 9.9 2.6
## Other 1.0 1.3 14.0 27.2 25.0 14.8 13.2 3.5
## White 0.8 1.3 14.0 26.2 23.9 17.4 13.4 3.0
##
## , , Sex = Male
##
## Age
## Race <=14 >55 15-19 20-24 25-29 30-34 35-44 45-54
## Black 0.2 4.0 6.9 19.9 20.2 21.9 18.9 7.9
## Other 0.2 3.8 7.4 23.1 22.4 18.4 17.4 7.1
## White 0.1 7.7 3.1 14.5 19.6 20.1 23.3 11.5
And Finally on the Sex and Age Seperated by Race
row_total <- apply(std_table, c(2,3), sum)
sex_age_by_race_marginals <- sweep(std_table, c(2,3), row_total, '/')
sex_age_by_race_marginals = sex_age_by_race_marginals * 100
round(sex_age_by_race_marginals, digits = 1)
## , , Sex = Female
##
## Age
## Race <=14 >55 15-19 20-24 25-29 30-34 35-44 45-54
## Black 86.8 70.2 84.6 85.2 83.4 84.5 79.3 80.5
## Other 5.8 11.5 5.9 5.8 6.6 5.4 7.9 8.2
## White 7.4 18.3 9.5 9.0 10.1 10.2 12.8 11.3
##
## , , Sex = Male
##
## Age
## Race <=14 >55 15-19 20-24 25-29 30-34 35-44 45-54
## Black 77.5 71.8 82.6 79.3 77.7 80.4 77.1 75.5
## Other 17.5 9.4 12.3 12.8 11.9 9.4 9.8 9.4
## White 5.0 18.8 5.1 7.9 10.4 10.2 13.1 15.1
3.4.4: Review and read hte grouplevel, tabular data set of the primary and secondary syphilis cases in the United States in 1989. Use the rep function on the data fram fields to recreaate the indiviudal-level data frame with over 40,000 observations.
std89b = read.csv("syph89b.txt", sep = ",", header = T)
head(std89b)
## Sex Race Age Freq
## 1 Male White <=14 2
## 2 Female White <=14 14
## 3 Male Black <=14 31
## 4 Female Black <=14 165
## 5 Male Other <=14 7
## 6 Female Other <=14 11
sex <- rep(std89b$Sex, std89b$Freq)
race <- rep(std89b$Race, std89b$Freq)
age <- rep(std89b$Age, std89b$Freq)
std89bdf <- data.frame(sex, race, age)
summary(std89bdf)
## sex race age
## Female:18075 Black:35508 20-24 :10405
## Male :26006 Other: 3956 25-29 : 9610
## White: 4617 30-34 : 8648
## 35-44 : 6901
## 15-19 : 4378
## 45-54 : 2631
## (Other): 1508
3.4.5 Working with population data estimates can be challenging becausue of the amount of data manipulation. Study the 200 population estimates for California Counties. Now study and implement this R code. For each expression or group of expressnions, explain in words what this R Code is doing. Be sure to display the intermediate objects to understand each step.
cty <- scan("calcounty.txt", what="")
head(cty)
## [1] "Alameda" "Alpine" "Amador" "Butte" "Calaveras" "Colusa"
calpop <- read.csv("CalCounties2000.txt", header = T)
head(calpop)
## County Year Sex Age White Hispanic Asian Pacific.Islander Black
## 1 1 2000 F 0 2751 2910 1921 63 1346
## 2 1 2000 F 1 2673 2837 1820 70 1343
## 3 1 2000 F 2 2654 2770 1930 64 1442
## 4 1 2000 F 3 2646 2798 1998 64 1455
## 5 1 2000 F 4 2780 2762 2029 85 1558
## 6 1 2000 F 5 2826 2738 1998 93 1632
## American.Indian Multirace
## 1 32 711
## 2 37 574
## 3 37 579
## 4 39 588
## 5 38 575
## 6 45 576
for(i in 1:length(cty)){
calpop$County[calpop$County==i] <- cty[i]
}
head(calpop)
## County Year Sex Age White Hispanic Asian Pacific.Islander Black
## 1 Alameda 2000 F 0 2751 2910 1921 63 1346
## 2 Alameda 2000 F 1 2673 2837 1820 70 1343
## 3 Alameda 2000 F 2 2654 2770 1930 64 1442
## 4 Alameda 2000 F 3 2646 2798 1998 64 1455
## 5 Alameda 2000 F 4 2780 2762 2029 85 1558
## 6 Alameda 2000 F 5 2826 2738 1998 93 1632
## American.Indian Multirace
## 1 32 711
## 2 37 574
## 3 37 579
## 4 39 588
## 5 38 575
## 6 45 576
This first series of commands does the following:
It imports a list of counties, and then a wider database that gives demographic information for the California Counties. Next, the for loop replaces County numbers with county names using the list of counties document we scanned in earlier. For example, in the later version of calpop, you replace 1 with Alameda County, the county that alphabetically comes first.
calpop$Agecat <- cut(calpop$Age, c(0,20,45,65,100), include.lowest = T, right = F); head(calpop)
## County Year Sex Age White Hispanic Asian Pacific.Islander Black
## 1 Alameda 2000 F 0 2751 2910 1921 63 1346
## 2 Alameda 2000 F 1 2673 2837 1820 70 1343
## 3 Alameda 2000 F 2 2654 2770 1930 64 1442
## 4 Alameda 2000 F 3 2646 2798 1998 64 1455
## 5 Alameda 2000 F 4 2780 2762 2029 85 1558
## 6 Alameda 2000 F 5 2826 2738 1998 93 1632
## American.Indian Multirace Agecat
## 1 32 711 [0,20)
## 2 37 574 [0,20)
## 3 37 579 [0,20)
## 4 39 588 [0,20)
## 5 38 575 [0,20)
## 6 45 576 [0,20)
This second set of commands creates a new column, Agecat, which categories age into values rather than an integer representation.
calpop$AsianPI <- calpop$Asian + calpop$Pacific.Islander; head(calpop)
## County Year Sex Age White Hispanic Asian Pacific.Islander Black
## 1 Alameda 2000 F 0 2751 2910 1921 63 1346
## 2 Alameda 2000 F 1 2673 2837 1820 70 1343
## 3 Alameda 2000 F 2 2654 2770 1930 64 1442
## 4 Alameda 2000 F 3 2646 2798 1998 64 1455
## 5 Alameda 2000 F 4 2780 2762 2029 85 1558
## 6 Alameda 2000 F 5 2826 2738 1998 93 1632
## American.Indian Multirace Agecat AsianPI
## 1 32 711 [0,20) 1984
## 2 37 574 [0,20) 1890
## 3 37 579 [0,20) 1994
## 4 39 588 [0,20) 2062
## 5 38 575 [0,20) 2114
## 6 45 576 [0,20) 2091
This combines Asian and Pacific Islanders into 1 new category, AsianPI
names(calpop)[c(6,9,10)] = c("Latino", "AfrAmer", "AmerInd"); head(calpop)
## County Year Sex Age White Latino Asian Pacific.Islander AfrAmer AmerInd
## 1 Alameda 2000 F 0 2751 2910 1921 63 1346 32
## 2 Alameda 2000 F 1 2673 2837 1820 70 1343 37
## 3 Alameda 2000 F 2 2654 2770 1930 64 1442 37
## 4 Alameda 2000 F 3 2646 2798 1998 64 1455 39
## 5 Alameda 2000 F 4 2780 2762 2029 85 1558 38
## 6 Alameda 2000 F 5 2826 2738 1998 93 1632 45
## Multirace Agecat AsianPI
## 1 711 [0,20) 1984
## 2 574 [0,20) 1890
## 3 579 [0,20) 1994
## 4 588 [0,20) 2062
## 5 575 [0,20) 2114
## 6 576 [0,20) 2091
This command shortens the names of some columns to make it easier to fit.
baindex <- calpop$County =="Alameda" | calpop$County == "San Francisco"
bapop <- calpop[baindex,]
head(bapop)
## County Year Sex Age White Latino Asian Pacific.Islander AfrAmer AmerInd
## 1 Alameda 2000 F 0 2751 2910 1921 63 1346 32
## 2 Alameda 2000 F 1 2673 2837 1820 70 1343 37
## 3 Alameda 2000 F 2 2654 2770 1930 64 1442 37
## 4 Alameda 2000 F 3 2646 2798 1998 64 1455 39
## 5 Alameda 2000 F 4 2780 2762 2029 85 1558 38
## 6 Alameda 2000 F 5 2826 2738 1998 93 1632 45
## Multirace Agecat AsianPI
## 1 711 [0,20) 1984
## 2 574 [0,20) 1890
## 3 579 [0,20) 1994
## 4 588 [0,20) 2062
## 5 575 [0,20) 2114
## 6 576 [0,20) 2091
This condenses the values to just SF country and Alameda County, representing some of the Bay Area.
agelabs <- names(table(bapop$Agecat))
sexlabs <- c("Female", "Male")
racen <- c("White", "AfrAmer", "AsianPI", "Latino", "Multirace", "AmerInd")
ctylabs <- names(table(bapop$County))
bapop2 <- aggregate(bapop[, racen], list(Agecat = bapop$Agecat, Sex = bapop$Sex, County = bapop$County), sum)
bapop2
## Agecat Sex County White AfrAmer AsianPI Latino Multirace
## 1 [0,20) F Alameda 58160 31765 40653 49738 10120
## 2 [20,45) F Alameda 112326 44437 72923 58553 7658
## 3 [45,65) F Alameda 82205 24948 33236 18534 2922
## 4 [65,100] F Alameda 49762 12834 16004 7548 1014
## 5 [0,20) M Alameda 61446 32277 42922 53097 10102
## 6 [20,45) M Alameda 115745 36976 69053 69233 6795
## 7 [45,65) M Alameda 81332 20737 29841 17402 2506
## 8 [65,100] M Alameda 33994 8087 11855 5416 711
## 9 [0,20) F San Francisco 14355 6986 23265 13251 2940
## 10 [20,45) F San Francisco 85766 10284 52479 23458 3656
## 11 [45,65) F San Francisco 35617 6890 31478 9184 1144
## 12 [65,100] F San Francisco 27215 5172 23044 5773 554
## 13 [0,20) M San Francisco 14881 6959 24541 14480 2851
## 14 [20,45) M San Francisco 105798 11111 48379 31605 3766
## 15 [45,65) M San Francisco 43694 7352 26404 8674 1220
## 16 [65,100] M San Francisco 20072 3329 17190 3428 450
## AmerInd
## 1 839
## 2 1401
## 3 822
## 4 246
## 5 828
## 6 1263
## 7 687
## 8 156
## 9 173
## 10 526
## 11 282
## 12 121
## 13 165
## 14 782
## 15 354
## 16 76
This takes the data and applies it to a new data frame, bapop2. We see in the final output Age, Sex, County, as well as other data for Alemeda and SF county.
tmp <- as.matrix(cbind(bapop2[1:4, racen], bapop2[5:8, racen], bapop2[9:12, racen], bapop2[13:16, racen]))
bapop3 <- array(tmp, c(4,6,2,2))
dimnames(bapop3) <- list(agelabs, racen, sexlabs, ctylabs)
bapop3
## , , Female, Alameda
##
## White AfrAmer AsianPI Latino Multirace AmerInd
## [0,20) 58160 31765 40653 49738 10120 839
## [20,45) 112326 44437 72923 58553 7658 1401
## [45,65) 82205 24948 33236 18534 2922 822
## [65,100] 49762 12834 16004 7548 1014 246
##
## , , Male, Alameda
##
## White AfrAmer AsianPI Latino Multirace AmerInd
## [0,20) 61446 32277 42922 53097 10102 828
## [20,45) 115745 36976 69053 69233 6795 1263
## [45,65) 81332 20737 29841 17402 2506 687
## [65,100] 33994 8087 11855 5416 711 156
##
## , , Female, San Francisco
##
## White AfrAmer AsianPI Latino Multirace AmerInd
## [0,20) 14355 6986 23265 13251 2940 173
## [20,45) 85766 10284 52479 23458 3656 526
## [45,65) 35617 6890 31478 9184 1144 282
## [65,100] 27215 5172 23044 5773 554 121
##
## , , Male, San Francisco
##
## White AfrAmer AsianPI Latino Multirace AmerInd
## [0,20) 14881 6959 24541 14480 2851 165
## [20,45) 105798 11111 48379 31605 3766 782
## [45,65) 43694 7352 26404 8674 1220 354
## [65,100] 20072 3329 17190 3428 450 76
This creates a final array, seperated along 4 values – Age, Ethnicity, Sex, and County of Residence.