Homework 1a: Descriptive Statistics and Visualizations

Question 1 - Data

I’m interested in the effects of temperature on the behavior, ecology and physiology of amphibians. For one of my dissertation chapters, I’m comparing thermal optima for performance (the temperature at which an animal maximizes a performance metric, in this case energy assimilation) and critical thermal limits (the minimum and maximum temperatures at which an animal loses critical motor function) across the geographic range of a widespread species. My collection sites also contain 3 genetically distinct clades, so I can compare physiology among clades to determine whether evolutionary history constrains this species’ response to temperature.

Some variables of interest: 1) Thermal optimum for performance: quantitative, range = 0 - 30°C 2) Environmental temperature: quantitative, range depends on the chosen metric. I have daily ground temp data from 1980-2015 for each site. From this data, I have calculated mean daily max temp, mean daily min temp, mean daily temp, mean annual days below 5°C, mean annual days above 25° C, etc. 3) Clade: categorical, 3 clades 4) Site: categorical, 12 sites 5) Individual body metrics: mass (quantitative), snout-to-vent length (quantitative), tail length (quantitative), total length (quantitative), sex (categorical)

After estimating thermal optimum for performance from thermal performance curves (energy assimilation ~ temperature), I will model thermal optima as a response variable and environmental temp * clade as explanatory variables.

Question 2 - Making vectors

  1. using seq()
seq(2,100)
##  [1]   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
## [18]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
## [35]  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52
## [52]  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69
## [69]  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86
## [86]  87  88  89  90  91  92  93  94  95  96  97  98  99 100
  1. using the ‘:’ operator
c(2:100)
##  [1]   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
## [18]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
## [35]  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52
## [52]  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69
## [69]  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86
## [86]  87  88  89  90  91  92  93  94  95  96  97  98  99 100
  1. Using a combo of cumsum() and rep()
cumsum(rep(1,100))[-1]
##  [1]   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
## [18]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
## [35]  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52
## [52]  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69
## [69]  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86
## [86]  87  88  89  90  91  92  93  94  95  96  97  98  99 100

Question 3 - Sums

set.seed(26)
dat <- rnorm(10)

(A <- sum(dat))
## [1] 0.5174843
(B <- sum(dat^2))
## [1] 7.999122
(C <- sum(dat)/mean(dat)/length(dat))
## [1] 1
(D <- sum(dat - mean(dat))^2)
## [1] 3.774823e-32
(E <- B - length(dat) * mean(dat)^2)
## [1] 7.972343

The solutions of B and E are always equal.

Question 4 - Global patterns

## Loading required package: plyr
  1. Not shown
  2. 10 countries with the lowest and highest GDP per capita, the highest and lowest birth rates, and the lowest literacy
##                        lowestGDP
## 1  Congo, Democratic Republic of
## 2                        Liberia
## 3                        Burundi
## 4                       Zimbabwe
## 5                        Eritrea
## 6       Central African Republic
## 7                          Niger
## 8                   Sierra Leone
## 9                         Malawi
## 10                          Togo
##                                      highestGDP            lowestbirth
## 1                     Saint Pierre and Miquelon              Hong Kong
## 2                                    San Marino                  Japan
## 3                                  Sint Maarten                Germany
## 4                                       Somalia                Andorra
## 5  South Georgia and the South Sandwich Islands                  Italy
## 6                                   South Sudan                  Macau
## 7                                       Tokelau               Guernsey
## 8                      Turks and Caicos Islands                Austria
## 9                                        Tuvalu Bosnia and Herzegovina
## 10                                 Vatican City              Lithuania
##                                    highestbirth    lowestlit
## 1                       Palestinian territories         Mali
## 2                              Pitcairn Islands  South Sudan
## 3                              Saint Barth�lemy Burkina Faso
## 4                                  Saint Martin        Niger
## 5                                  Sint Maarten       Guinea
## 6  South Georgia and the South Sandwich Islands         Chad
## 7                                   South Sudan     Ethiopia
## 8                         S�o Tom� and Pr�ncipe Sierra Leone
## 9                                       Tokelau        Benin
## 10                                 Vatican City      Senegal

C/D) 10 countries with the highest densities

##         Country Population Area Water   Density
## 1       Bahrain    1234596  758  0.00  1628.755
## 2       Bermuda      64566   54  0.00  1195.667
## 3      Dominica    9378818  751  0.72 12488.439
## 4     Gibraltar      29441    6  0.00  4906.833
## 5     Hong Kong    7097600 1104  4.53  6428.986
## 6         Macau     556800   30  0.00 18560.000
## 7         Malta     417608  316  0.00  1321.544
## 8        Monaco      35000    2  0.00 17500.000
## 9     Singapore    5076700  710  1.43  7150.282
## 10 Sint Maarten      37429   34  0.00  1100.853

10 countries with the lowest densities

##             Country Population    Area Water    Density
## 1         Australia   22722835 7692024  0.76 2.95407750
## 2          Botswana    1800098  582000  2.58 3.09295189
## 3  Falkland Islands       3000   12173  0.00 0.24644705
## 4         Greenland      56452 2166086    NA 0.02606175
## 5           Iceland     318452  103000  2.67 3.09176699
## 6          Mongolia    2822900 1564100  0.68 1.80480788
## 7           Namibia    2088669  824268  0.12 2.53396832
## 8  Pitcairn Islands         50      47  0.00 1.06382979
## 9          Suriname     525000  163820  4.77 3.20473691
## 10   Western Sahara     531000  266000  0.00 1.99624060
  1. Histogram of birth rates in Europe (red), Asia (purple), and Africa (green)

  2. Histogram of birth rates in Europe (red), Asia (purple), and Africa (green) with fitted density line
  3. The density of birth rates are unimodal in Europe and Asia and bimodal in Africa (easier to visualize when Africa is plotted separately). Europe has a right-skewed distribution, Asia has a symmetric distribution, and Africa has a left-skewed distribution. Europe has a narrow spread; it’s birth rates do not exceed 20. Asia and Africa have greater spreads, with birth rates ranging from ~ 7 - 46 in Asian countries and ~ 12 - 50 in African countries.

Question 5 - Skewness

  1. Write a function to calculate skewness of a vector that’s robust to NA’s
Skew <- function(x) {
  x2 <- x[!is.na(x)]
  n <- length(x2)
  mn <- sum(x2)/n
  skew <- (sum((x2 - mn)^3)/n) / (sum((x2 - mn)^2)/n)^(3/2)
  return(skew)
}
  1. Skewness of global GDP, literacy and birth rates
## [1] 1.870465
## [1] -1.417821
## [1] 0.7485703

GDP has the greatest skew, and it’s skewed to the right. Literacy has the 2nd greatest skew and is skewed to the left. Birth rate is the least skewed and is skewed to the right.

These assessments correspond to the visualization, show in the histograms above. The strong right skew of GDP makes logical sense, knowing that the majority of wealth in the world is held by relatively few countries and individuals. The strong left skew is kind of surprising. I would have expected it to be less skewed, with a larger variance of literacy among countries. Birth rates match my real-world predictions.

  1. Skewness of continental GDP, literacy and birth rates
##       Continent  skew.GDP skew.literacy skew.birthrate
## 1        Africa 2.0451059    -0.4611914     -0.6254087
## 2          Asia 1.7775283    -1.3431745      0.4778668
## 3        Europe 1.2085067    -3.1042619      1.2721972
## 4 North America 1.6008635    -1.0741350      0.8174836
## 5       Oceania 1.3231944    -0.7961659      0.0174787
## 6 South America 0.1406809     0.1837546      0.9035542
## 7          <NA>       NaN           NaN            NaN

Interpretation:

  1. GDP - There’s a wide range in the strength of the skew for GDP, but all continents have a right skew. The difference is especially apparant between Africa, which has a very strong skew and South America, which has a very weak skew. This suggests that the majority of wealth in Africa belows to very few countries, while the wealth is distributed much more evenly among countries in South America.

  2. Literacy - There is a left skew in literacy for all of the continents, except South America. This time, we see a stark difference between South America and Europe, where South America has a weak right skew and Europe has a very strong left skew. Almost all European countries fall into the top tier of literacy, while South American countries have a wide range of literacy.

  3. Birth rate - There is a right skew in birth rate for all continents, except Africa. We looked into this pattern for Africa earlier (Q4).

Question 6: The Gapminder Map (extra credit)

## Warning in plot.xy(xy.coords(x, y), type = type, ...): "log" is not a
## graphical parameter

Homework 1b: Descriptive Statistics and Visualizations

Question 7 - Write summary Stats Function

getSummaryStatsByHand <- function(x,y, plotreg =TRUE, plotdiag = TRUE)
{n <- length(x)
  x.bar <- sum(x)/n
  y.bar <- sum(y)/n
  x.sd <- sd(x)
  y.sd <- sd(y)
  r <- cor(x, y)
  b <- r * (y.sd / x.sd)
  a <- y.bar - b * x.bar
  SS.total <- sum((y - y.bar)^2)
  SS.residual <- sum((y - (a + b * x))^2)
  SS.model <- SS.total - SS.residual
  r2 <- SS.model / SS.total
  
  if(plotreg)
  # plot the regression
  
  if(plotdiag)
  # plot the diagnostics
  
  return(list(x.bar = x.bar, y.bar = y.bar, 
              x.sd = x.sd, y.sd = y.sd, 
              r = r, a = a, b = b,
              SS.total = SS.total, SS.model = SS.model, SS.residual =
              SS.residual, r2 = r2))
}

Question 8 - Anscombes Quartet

##      x.bar    y.bar     x.sd     y.sd         r        a         b
## [1,]     9 7.500909 3.316625 2.031568 0.8164205 3.000091 0.5000909
## [2,]     9 7.500909 3.316625 2.031657 0.8162365 3.000909 0.5000000
## [3,]     9 7.500000 3.316625 2.030424 0.8162867 3.002455 0.4997273
## [4,]     9 7.500909 3.316625 2.030579 0.8165214 3.001727 0.4999091
##      SS.total SS.model SS.residual        r2
## [1,] 41.27269 27.51000    13.76269 0.6665425
## [2,] 41.27629 27.50000    13.77629 0.6662420
## [3,] 41.22620 27.47001    13.75619 0.6663240
## [4,] 41.23249 27.49000    13.74249 0.6667073

  1. There appears to be a linear relationship between x and y in the 1st dataset (red), a quadratic relationship between x and y in the 2nd (blue) and 3rd (orange) datasets, and no relationship between x and y in the 4th (purple) dataset. When plotting all 4 datasets together using only linear models, the regression lines are equal. Although the variables in these datasets actually have very different relationships, the solutions for regression coefficients are equal. However, t is not appropriate to use linear models for all 4 datasets given the nature of the relationships between x and y.

  2. It’s clear that the quartet datasets have similar summary stats, but appear different from one another on a graph. After Googling it, I found that the quartet is meant to illustrate the importance of graphical interpretation/data visualization prior to analyses.

  3. Bonus - I accidentally plotted the curves before the regression lines, so the plot is included above.

Question 9 - De Mere’s wagers

  1. I wrote a function, DeMereA, to generate n number of random rolls of 6-sided dice. It returns TRUE when at least one 6 is rolled and FALSE when 6 is not rolled.

  2. I used the replicate function to simulate 1000 rolls of de Mere’s game, which rolls 4 dice at a time. I then used the length function to count the number of times the function returned “TRUE”, aka the number of victories.

DeMereA <- function(n) {
  roll <- sample(1:6, size = n, replace = TRUE)
  if(6 %in% roll){
    return(TRUE)}
  else{return(FALSE)}}

roll.1000 <- replicate(1000, DeMereA(4))
table(roll.1000)
## roll.1000
## FALSE  TRUE 
##   503   497
  1. I wrote a function, DeMereB, to update the simulation from above to include 24 dice/game. I also changed the rules so that at least 1 in 24 die must roll double 6’s for a victory.

  2. I simulated this game 1000 times.

DeMereB <- function(n) {
  die1 <- replicate(24, sample(1:6, size = n, replace = TRUE))
  die2 <- replicate(24, sample(1:6, size = n, replace = TRUE))
  if(die1 == 6 & die2 == 6){
    return(TRUE)}
  else{return(FALSE)}}

rollB.1000 <- replicate(1000, DeMereB(1))
table(rollB.1000)
## rollB.1000
## FALSE  TRUE 
##   965    35

Question 10 - The Halloween Problem

## [1] "Outcome of die roll at Mr. Abel's"
## x.a
## 5 
## 1

## [1] "How much candy did the trick or treater get at Mrs. Bernoulli's?"
## x.b
## H T 
## 3 3
## [1] 3

## [1] "What's the probability of getting all 6 pieces of candy at Mrs. Bernoulli's?"
## [1] 0.015625
## [1] "How much candy does the trick or treater get at Dr. Cauchy's?"
## x.c
##  Club Heart Spade 
##     1     1     2
## [1] 3

## [1] "Number of candies that can be expected to be obtained from Abel"
## [1] "3.5 +- 3.5"
## [1] "Number of candies that can be expected to be obtained from Bernoulli"
## [1] "3 +- 1.5"
## [1] "Number of candies that can be expected to be obtained from Cauchy"
## [1] "3 +- 0.75"

Trick or Treaters can expect the most candy from Mr. Abel, and Dr. Cauchy will provide the most consistent amount of candy.