Recipe 2: Example of Descriptive Statistics

This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the MD toolbar button for help on Markdown).

When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Recipes for the Design of Experiments: Recipe Outline

as of August 28, 2014, superceding the version of August 24. Always use the most recent version.

Design of Experiments

Trevor Manzanares

Rensselaer Polytechnic Institute

9/26/14

1. Setting

System under test

Choose one of the large datasets listed on the Realtime Board (e.g., babynames or nasaweather)
Make sure you have > 1000 data What is the problem that you were given?

remove(list=ls())
library("babynames", lib.loc="~/R/win-library/3.1")
#get information on babynames dataset
??babynames
## starting httpd help server ... done
#name dataset
x<-babynames
#view first few lines
head(x)
##   year sex      name    n    prop
## 1 1880   F      Mary 7065 0.07238
## 2 1880   F      Anna 2604 0.02668
## 3 1880   F      Emma 2003 0.02052
## 4 1880   F Elizabeth 1939 0.01987
## 5 1880   F    Minnie 1746 0.01789
## 6 1880   F  Margaret 1578 0.01617
#observe the structure of the data, ie. how many variables
str(x)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1792091 obs. of  5 variables:
##  $ year: num  1880 1880 1880 1880 1880 1880 1880 1880 1880 1880 ...
##  $ sex : chr  "F" "F" "F" "F" ...
##  $ name: chr  "Mary" "Anna" "Emma" "Elizabeth" ...
##  $ n   : int  7065 2604 2003 1939 1746 1578 1472 1414 1320 1288 ...
##  $ prop: num  0.0724 0.0267 0.0205 0.0199 0.0179 ...
#subset data for computational purposes. We will only look at babies named Mary.
x=subset(x,name=="Mary")
x
##         year sex name     n      prop
## 1       1880   F Mary  7065 7.238e-02
## 1274    1880   M Mary    27 2.280e-04
## 2001    1881   F Mary  6919 6.999e-02
## 3239    1881   M Mary    29 2.678e-04
## 3936    1882   F Mary  8148 7.042e-02
## 5278    1882   M Mary    30 2.458e-04
## 6063    1883   F Mary  8012 6.673e-02
## 7408    1883   M Mary    32 2.845e-04
## 8147    1884   F Mary  9217 6.699e-02
## 9611    1884   M Mary    36 2.933e-04
## 10444   1885   F Mary  9128 6.430e-02
## 11911   1885   M Mary    38 3.277e-04
## 12738   1886   F Mary  9890 6.433e-02
## 14324   1886   M Mary    32 2.688e-04
## 15130   1887   F Mary  9888 6.362e-02
## 16672   1887   M Mary    47 4.300e-04
## 17503   1888   F Mary 11754 6.204e-02
## 19224   1888   M Mary    50 3.849e-04
## 20154   1889   F Mary 11648 6.156e-02
## 21894   1889   M Mary    41 3.444e-04
## 22744   1890   F Mary 12078 5.989e-02
## 24570   1890   M Mary    35 2.924e-04
## 25439   1891   F Mary 11703 5.954e-02
## 27230   1891   M Mary    39 3.569e-04
## 28099   1892   F Mary 13174 5.857e-02
## 30010   1892   M Mary    50 3.804e-04
## 31020   1893   F Mary 12784 5.676e-02
## 32903   1893   M Mary    55 4.544e-04
## 33851   1894   F Mary 13151 5.573e-02
## 35803   1894   M Mary    48 3.843e-04
## 36792   1895   F Mary 13446 5.441e-02
## 38860   1895   M Mary    47 3.711e-04
## 39841   1896   F Mary 13811 5.481e-02
## 41907   1896   M Mary    57 4.416e-04
## 42932   1897   F Mary 13413 5.402e-02
## 44941   1897   M Mary    63 5.166e-04
## 45960   1898   F Mary 14406 5.255e-02
## 48185   1898   M Mary    50 3.785e-04
## 49224   1899   F Mary 13172 5.322e-02
## 51301   1899   M Mary    50 4.340e-04
## 52266   1900   F Mary 16707 5.257e-02
## 54723   1900   M Mary    75 4.625e-04
## 55998   1901   F Mary 13136 5.167e-02
## 58171   1901   M Mary    58 5.017e-04
## 59151   1902   F Mary 14486 5.167e-02
## 61441   1902   M Mary    56 4.218e-04
## 62513   1903   F Mary 14275 5.131e-02
## 64825   1903   M Mary    63 4.871e-04
## 65902   1904   F Mary 14962 5.116e-02
## 68319   1904   M Mary    58 4.187e-04
## 69463   1905   F Mary 16067 5.185e-02
## 71936   1905   M Mary    68 4.747e-04
## 73119   1906   F Mary 16370 5.223e-02
## 75563   1906   M Mary    76 5.275e-04
## 76752   1907   F Mary 17580 5.210e-02
## 79402   1907   M Mary    70 4.414e-04
## 80700   1908   F Mary 18666 5.265e-02
## 83369   1908   M Mary    79 4.748e-04
## 84718   1909   F Mary 19258 5.232e-02
## 87513   1909   M Mary    80 4.523e-04
## 88945   1910   F Mary 22847 5.446e-02
## 91972   1910   M Mary    99 4.748e-04
## 93574   1911   F Mary 24390 5.521e-02
## 96703   1911   M Mary    98 4.060e-04
## 98441   1912   F Mary 32302 5.506e-02
## 102235  1912   M Mary   118 2.614e-04
## 104792  1913   F Mary 36641 5.595e-02
## 108875  1913   M Mary   125 2.331e-04
## 111760  1914   F Mary 45344 5.692e-02
## 116388  1914   M Mary   127 1.859e-04
## 119723  1915   F Mary 58187 5.683e-02
## 125120  1915   M Mary   159 1.805e-04
## 129082  1916   F Mary 61436 5.659e-02
## 134680  1916   M Mary   164 1.776e-04
## 138778  1917   F Mary 64280 5.720e-02
## 144538  1917   M Mary   159 1.657e-04
## 148693  1918   F Mary 67372 5.603e-02
## 154735  1918   M Mary   169 1.611e-04
## 159093  1919   F Mary 65835 5.605e-02
## 165134  1919   M Mary   155 1.527e-04
## 169462  1920   F Mary 70980 5.706e-02
## 175657  1920   M Mary   195 1.771e-04
## 180219  1921   F Mary 73981 5.781e-02
## 186539  1921   M Mary   187 1.643e-04
## 191075  1922   F Mary 72172 5.785e-02
## 197313  1922   M Mary   186 1.653e-04
## 201833  1923   F Mary 71631 5.720e-02
## 207997  1923   M Mary   203 1.793e-04
## 212478  1924   F Mary 73520 5.674e-02
## 218789  1924   M Mary   223 1.908e-04
## 223347  1925   F Mary 70600 5.590e-02
## 229487  1925   M Mary   261 2.267e-04
## 233989  1926   F Mary 67828 5.514e-02
## 239960  1926   M Mary   272 2.375e-04
## 244449  1927   F Mary 70628 5.713e-02
## 250401  1927   M Mary   282 2.428e-04
## 254857  1928   F Mary 66862 5.594e-02
## 260623  1928   M Mary   294 2.577e-04
## 265019  1929   F Mary 63507 5.487e-02
## 270596  1929   M Mary   327 2.953e-04
## 274831  1930   F Mary 64131 5.499e-02
## 280386  1930   M Mary   340 3.011e-04
## 284618  1931   F Mary 60296 5.464e-02
## 289898  1931   M Mary   326 3.049e-04
## 293908  1932   F Mary 59866 5.412e-02
## 299311  1932   M Mary   330 3.073e-04
## 303289  1933   F Mary 55490 5.306e-02
## 308445  1933   M Mary   316 3.099e-04
## 312300  1934   F Mary 56911 5.259e-02
## 317584  1934   M Mary   303 2.854e-04
## 321480  1935   F Mary 55067 5.068e-02
## 326697  1935   M Mary   276 2.581e-04
## 330512  1936   F Mary 54362 5.046e-02
## 335680  1936   M Mary   293 2.754e-04
## 339407  1937   F Mary 55639 5.050e-02
## 344653  1937   M Mary   288 2.634e-04
## 348353  1938   F Mary 56208 4.925e-02
## 353676  1938   M Mary   284 2.500e-04
## 357378  1939   F Mary 54898 4.841e-02
## 362670  1939   M Mary   272 2.401e-04
## 366297  1940   F Mary 56197 4.758e-02
## 371648  1940   M Mary   303 2.555e-04
## 375258  1941   F Mary 58026 4.658e-02
## 380672  1941   M Mary   315 2.511e-04
## 384342  1942   F Mary 63242 4.549e-02
## 390084  1942   M Mary   276 1.960e-04
## 393767  1943   F Mary 66164 4.610e-02
## 399514  1943   M Mary   254 1.747e-04
## 403173  1944   F Mary 62468 4.572e-02
## 408796  1944   M Mary   252 1.815e-04
## 412327  1945   F Mary 59286 4.404e-02
## 417996  1945   M Mary   198 1.444e-04
## 421353  1946   F Mary 67464 4.183e-02
## 427509  1946   M Mary   186 1.127e-04
## 431058  1947   F Mary 71679 3.943e-02
## 437658  1947   M Mary   184 9.907e-05
## 441431  1948   F Mary 68601 3.936e-02
## 447984  1948   M Mary   163 9.142e-05
## 451668  1949   F Mary 66844 3.809e-02
## 458235  1949   M Mary   170 9.443e-05
## 461924  1950   F Mary 65454 3.722e-02
## 468661  1950   M Mary   120 6.597e-05
## 472232  1951   F Mary 65677 3.557e-02
## 478979  1951   M Mary   160 8.377e-05
## 482691  1952   F Mary 65677 3.453e-02
## 489634  1952   M Mary   158 8.009e-05
## 493339  1953   F Mary 64338 3.336e-02
## 500428  1953   M Mary   146 7.302e-05
## 504169  1954   F Mary 67994 3.416e-02
## 511342  1954   M Mary   169 8.174e-05
## 515130  1955   F Mary 63165 3.151e-02
## 522458  1955   M Mary   149 7.135e-05
## 526244  1956   F Mary 61753 2.999e-02
## 533765  1956   M Mary   143 6.668e-05
## 537584  1957   F Mary 61086 2.912e-02
## 545166  1957   M Mary   184 8.412e-05
## 549148  1958   F Mary 55846 2.705e-02
## 556821  1958   M Mary   141 6.549e-05
## 560667  1959   F Mary 54474 2.621e-02
## 568457  1959   M Mary   170 7.847e-05
## 572434  1960   F Mary 51479 2.475e-02
## 580371  1960   M Mary   169 7.802e-05
## 584358  1961   F Mary 47660 2.296e-02
## 592496  1961   M Mary   167 7.747e-05
## 596533  1962   F Mary 43494 2.146e-02
## 604727  1962   M Mary   158 7.516e-05
## 608736  1963   F Mary 41550 2.090e-02
## 617066  1963   M Mary   136 6.585e-05
## 621015  1964   F Mary 40984 2.094e-02
## 629482  1964   M Mary   134 6.609e-05
## 633406  1965   F Mary 34272 1.876e-02
## 641579  1965   M Mary   136 7.174e-05
## 645359  1966   F Mary 28883 1.645e-02
## 653686  1966   M Mary   108 5.940e-05
## 657508  1967   F Mary 25314 1.475e-02
## 666017  1967   M Mary   127 7.135e-05
## 669908  1968   F Mary 21721 1.271e-02
## 678839  1968   M Mary   101 5.686e-05
## 682838  1969   F Mary 19853 1.126e-02
## 692293  1969   M Mary   108 5.900e-05
## 696590  1970   F Mary 19200 1.048e-02
## 706740  1970   M Mary    99 5.195e-05
## 711364  1971   F Mary 16698 9.530e-03
## 721865  1971   M Mary    86 4.729e-05
## 726653  1972   F Mary 13763 8.535e-03
## 737230  1972   M Mary    74 4.418e-05
## 742065  1973   F Mary 12322 7.929e-03
## 752968  1973   M Mary    54 3.345e-05
## 757739  1974   F Mary 11752 7.504e-03
## 769022  1974   M Mary    61 3.741e-05
## 773987  1975   F Mary 10964 7.026e-03
## 785585  1975   M Mary    66 4.066e-05
## 790919  1976   F Mary 10325 6.569e-03
## 802864  1976   M Mary    62 3.796e-05
## 808311  1977   F Mary 10661 6.482e-03
## 820842  1977   M Mary    52 3.042e-05
## 826483  1978   F Mary 10050 6.115e-03
## 839052  1978   M Mary    61 3.570e-05
## 844705  1979   F Mary 10552 6.125e-03
## 857750  1979   M Mary    65 3.629e-05
## 863731  1980   F Mary 11473 6.446e-03
## 877164  1980   M Mary    52 2.804e-05
## 883168  1981   F Mary 11037 6.174e-03
## 896442  1981   M Mary    67 3.599e-05
## 902634  1982   F Mary 10849 5.983e-03
## 915993  1982   M Mary    73 3.870e-05
## 922314  1983   F Mary  9894 5.531e-03
## 935553  1983   M Mary    57 3.061e-05
## 941709  1984   F Mary  9288 5.153e-03
## 955074  1984   M Mary    56 2.986e-05
## 961209  1985   F Mary  9238 5.006e-03
## 974815  1985   M Mary    67 3.484e-05
## 981288  1986   F Mary  8504 4.611e-03
## 995533  1986   M Mary    47 2.448e-05
## 1001923 1987   F Mary  8394 4.481e-03
## 1016293 1987   M Mary    69 3.541e-05
## 1023315 1988   F Mary  8508 4.427e-03
## 1038693 1988   M Mary    48 2.399e-05
## 1045673 1989   F Mary  8640 4.338e-03
## 1061435 1989   M Mary    74 3.532e-05
## 1069433 1990   F Mary  8662 4.218e-03
## 1086397 1990   M Mary    45 2.092e-05
## 1094146 1991   F Mary  8756 4.307e-03
## 1111841 1991   M Mary    32 1.510e-05
## 1119248 1992   F Mary  8447 4.215e-03
## 1137604 1992   M Mary    25 1.191e-05
## 1144663 1993   F Mary  8103 4.112e-03
## 1164011 1993   M Mary    18 8.719e-06
## 1170614 1994   F Mary  7741 3.973e-03
## 1192071 1994   M Mary    10 4.908e-06
## 1196612 1995   F Mary  7424 3.865e-03
## 1217026 1995   M Mary    13 6.466e-06
## 1222695 1996   F Mary  6937 3.620e-03
## 1242854 1996   M Mary    15 7.490e-06
## 1249117 1997   F Mary  6623 3.471e-03
## 1270078 1997   M Mary    13 6.511e-06
## 1276081 1998   F Mary  6423 3.315e-03
## 1297425 1998   M Mary    14 6.909e-06
## 1303971 1999   F Mary  6356 3.267e-03
## 1326768 1999   M Mary    11 5.398e-06
## 1332516 2000   F Mary  6179 3.099e-03
## 1356766 2000   M Mary    10 4.792e-06
## 1362281 2001   F Mary  5722 2.891e-03
## 1385756 2001   M Mary    13 6.290e-06
## 1392542 2002   F Mary  5449 2.761e-03
## 1416510 2002   M Mary    12 5.811e-06
## 1423106 2003   F Mary  5003 2.496e-03
## 1449087 2003   M Mary     9 4.287e-06
## 1454281 2004   F Mary  4801 2.382e-03
## 1476293 2004   M Mary    30 1.421e-05
## 1486324 2005   F Mary  4445 2.193e-03
## 1512891 2005   M Mary    10 4.706e-06
## 1518860 2006   F Mary  4080 1.954e-03
## 1546618 2006   M Mary    10 4.567e-06
## 1552938 2007   F Mary  3670 1.737e-03
## 1581349 2007   M Mary    10 4.521e-06
## 1587866 2008   F Mary  3488 1.678e-03
## 1622910 2009   F Mary  3152 1.560e-03
## 1657585 2010   F Mary  2860 1.463e-03
## 1691615 2011   F Mary  2700 1.398e-03
## 1725483 2012   F Mary  2559 1.326e-03
## 1756514 2012   M Mary     6 2.971e-06
## 1759140 2013   F Mary  2602 1.363e-03
attach(x)

Factors and Levels

5 factors, year (133 levels), sex (2 levels), name (many levels), n (response), prop (many levels)

#view summary statistics
summary(x)
##       year          sex                name                 n        
##  Min.   :1880   Length:263         Length:263         Min.   :    6  
##  1st Qu.:1912   Class :character   Class :character   1st Qu.:   80  
##  Median :1945   Mode  :character   Mode  :character   Median : 2700  
##  Mean   :1945                                         Mean   :15694  
##  3rd Qu.:1978                                         3rd Qu.:16534  
##  Max.   :2013                                         Max.   :73981  
##       prop        
##  Min.   :0.00000  
##  1st Qu.:0.00016  
##  Median :0.00140  
##  Mean   :0.01727  
##  3rd Qu.:0.04294  
##  Max.   :0.07238

Continuous variables (if any)

n (int) is the number of babies with a particular name prop (num) is n/total # of applicants in that year ### Response variables n is the number of each particular name ### The Data: How is it organized and what does it look like? The data are tabluated into 5 columns, with little missing data. Variables are numeric, characters, and integers ### Randomization The data were collected by the US Social Security Administration each year from 1880 to 2013

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

Since the dataset is so large, it will need to be subsetted in order to make computational analysis feasible. Analysis of Variance will be used to determine if the factors sex or year have an effect on the number of baby names. Factor interactions and blocking will also be considered.
### What is the rationale for this design? It is possible that individual factors by themselves may have an effect on the number, and it is also possible that the combination of factors may have an effect on the number. ### Randomize: What is the Randomization Scheme? The data were collected by the US Social Security Administration each year from 1880 to 2013 ### Replicate: Are there replicates and/or repeated measures? There are replicates of each name and there are repeated measures, as the SSA received birth certificates every year
### Block: Did you use blocking in the design? Yes, one model was performed with blocking to determine if each of the factors had an effect by themselves.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

par(mfrow=c(1,1))
boxplot(n~year)

plot of chunk unnamed-chunk-3

boxplot(n~sex)

plot of chunk unnamed-chunk-3 ### Testing

#convert to factors
x$name=as.factor(x$year)
x$sex=as.factor(x$sex)

#run analysis of variance for individual factor effects
model1=(aov(n~year,data=x))
anova(model1)
## Analysis of Variance Table
## 
## Response: n
##            Df   Sum Sq  Mean Sq F value Pr(>F)  
## year        1 2.03e+09 2.03e+09    3.68  0.056 .
## Residuals 261 1.44e+11 5.52e+08                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#based on year alone, we fail to reject the H0 that year has no effect on n for all babies named #Mary.

model2=(aov(n~sex, data=x))
anova(model2)
## Analysis of Variance Table
## 
## Response: n
##            Df   Sum Sq  Mean Sq F value Pr(>F)    
## sex         1 6.14e+10 6.14e+10     189 <2e-16 ***
## Residuals 261 8.46e+10 3.24e+08                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#based on sex alone, we reject the H0 that sex has no effect on n for all babies named Mary.

#run analysis of variance using blocking
block=aov(n~year+sex, data=x)
anova(block)
## Analysis of Variance Table
## 
## Response: n
##            Df   Sum Sq  Mean Sq F value Pr(>F)    
## year        1 2.03e+09 2.03e+09    6.45  0.012 *  
## sex         1 6.22e+10 6.22e+10  197.74 <2e-16 ***
## Residuals 260 8.18e+10 3.15e+08                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#in this particular model, we reject the H0 that each factor independently has no effect on n for all babies named Mary. Both year and sex have an effect on the n for all babies named Mary.

#run analysis of variance with interaction
interaction=aov(n~year*sex, data=x)
anova(interaction)
## Analysis of Variance Table
## 
## Response: n
##            Df   Sum Sq  Mean Sq F value Pr(>F)    
## year        1 2.03e+09 2.03e+09    6.62 0.0106 *  
## sex         1 6.22e+10 6.22e+10  203.18 <2e-16 ***
## year:sex    1 2.50e+09 2.50e+09    8.15 0.0046 ** 
## Residuals 259 7.93e+10 3.06e+08                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#when considering interaction, all factors jointly have an effect on n for all babies named Mary.

Based on these results from the multiple analyses of variance, we reject the H0 that sex and the year 1880 have no effect on the number of babies with each specific name. Said differently, variation in the number of babies names can be explained by something other than randomization. However, we fail to reject the H0 based on the interaction effect between sex and year on the response variable n. Finally, we must check the normality to ensure these results are valid.

Estimation (of Parameters)

# Shapiro-Wilk test of normality.  Adequate if p < 0.1
shapiro.test(year[1:4999])
## 
##  Shapiro-Wilk normality test
## 
## data:  year[1:4999]
## W = 0.9561, p-value = 3.82e-07

One of the primary assumptions of the t-test is that the data are normally distributed, and since they are not, the results from the analysis of variance are essentially invalidated. ### Diagnostics/Model Adequacy Checking Describe

qqnorm(residuals(interaction))

plot of chunk unnamed-chunk-6

interaction.plot(n,sex,year)

plot of chunk unnamed-chunk-6

plot(fitted(interaction),residuals(interaction))

plot of chunk unnamed-chunk-6

4. References to the literature

??babynames

http://cran.r-project.org/web/packages/babynames/index.html

5. Appendices

A summary of, or pointer to, the raw data

complete and documented R code

library(“babynames”, lib.loc=“~/R/win-library/3.1”)