(\(\blacksquare\) indicates the end of the answer of a specific exercise)

First of all, let’s see the qanda data:

##     Hair    Birth Handspan Siblings Shoes Fish Height
## 1  Black     July     20.0        1    10    1    168
## 2 Blonde      May     20.0        0    40    1    186
## 3  Black  October     15.8        1    37    0    183
## 4  Black December     12.5        1     8    1    165
## 5  Brown    March     15.0        1    10    1    160
## 6  Black    March     18.3        0    15    1    165

We are now focusing on the relationship between Siblings and Handspan:

table (qanda$Siblings, qanda$Hair)
##    
##     Black Blonde Brown Grey Red
##   0    79      4    15    1   1
##   1    63      5    21    0   1
##   2    20      1     7    0   0
##   3     5      0     2    0   0
##   4     1      1     1    0   0
##   5     1      0     0    0   0
##   6     1      0     0    0   0
SHtable <- table (qanda$Siblings, qanda$Hair)

I made three tables:

round (prop.table (SHtable), 2)
##    
##     Black Blonde Brown Grey  Red
##   0  0.34   0.02  0.07 0.00 0.00
##   1  0.27   0.02  0.09 0.00 0.00
##   2  0.09   0.00  0.03 0.00 0.00
##   3  0.02   0.00  0.01 0.00 0.00
##   4  0.00   0.00  0.00 0.00 0.00
##   5  0.00   0.00  0.00 0.00 0.00
##   6  0.00   0.00  0.00 0.00 0.00
margin.table(SHtable, 1)
## 
##   0   1   2   3   4   5   6 
## 100  90  28   7   3   1   1
prop.table (margin.table (SHtable, 2))
## 
##       Black      Blonde       Brown        Grey         Red 
## 0.739130435 0.047826087 0.200000000 0.004347826 0.008695652

Exercise 2.1

- What do the tables from the last three R commands display?

- What do you think the 1 and 2 in the final two commands refer to?

- Run a single command in R that produces the proportion of students in the survey with four or more siblings.

  • the first table shows the value of each cell when they are divided by the grand sum

  • the second table shows the sum of all rows,

  • the third table shows: after constructing the table showing the column sums, I sum them up, and divide each individual cell (sum of each column) in the table by their sum.

  • 1: rows
  • 2: columns
  1. the proportion of students in the survey with four or more siblings
margin.table ( table (1:nrow(qanda),
                      qanda$Siblings) [,5:7]) / nrow(qanda)
## [1] 0.02173913

\(\blacksquare\)

Exercise 2.2.

- Calculate the values in location (1,1) for exp_table and resid_table using R and the STAT0002 formulae.

- Do the values of the residuals suggest that having black hair and no siblings are independent of each other?

calculate cell [1,1] - expected value

rowSums(SHtable)[1] * colSums (SHtable)[1] / sum(SHtable)[1]
##        0 
## 73.91304

calculate cell [1,1] - residual value

miu11 <- rowSums(SHtable)[1] * colSums (SHtable)[1] / sum(SHtable)[1]

( SHtable[1,1] - miu11) / sqrt( miu11 *
                                  (1 - rowSums (SHtable)[1] / sum(SHtable)) *
                                  (1 - colSums (SHtable)[1] / sum(SHtable))
                                )
##        0 
## 1.540913

If they are independent on each other, the standardized Pearson residuals are approximate to N(0,1).

–> Most of the residuals should lie in the range (-2, 2), in this case, using the rule of thumb, the black hair and no siblings are independent.

While looking at the values of other residuals, they appear to be dependent.

\(\blacksquare\)

Exercise 2.3.

Create and save a program in R which calculates the residuals’ table for Hair and Siblings.

- Do the values in the table suggest that Hair and Siblings are independent of each other?

- Which particular pairings, if any, show there may be an association?

- Can you think of a reason why we might have the strong association between black hair and no siblings?

For this question, we can use R to speed things up when making contingency table:

niplus <- rowSums (SHtable)
nplusj <- colSums (SHtable)

n <- sum(SHtable)

exp_table <- niplus %*% t(nplusj) /n
##           Black     Blonde Brown        Grey         Red
## [1,] 73.9130435 4.78260870  20.0 0.434782609 0.869565217
## [2,] 66.5217391 4.30434783  18.0 0.391304348 0.782608696
## [3,] 20.6956522 1.33913043   5.6 0.121739130 0.243478261
## [4,]  5.1739130 0.33478261   1.4 0.030434783 0.060869565
## [5,]  2.2173913 0.14347826   0.6 0.013043478 0.026086957
## [6,]  0.7391304 0.04782609   0.2 0.004347826 0.008695652
## [7,]  0.7391304 0.04782609   0.2 0.004347826 0.008695652
resid_num <- SHtable - exp_table
p_iplus <- rowSums (prop.table (SHtable))
p_plusj <- colSums (prop.table (SHtable))
resid_denom <- sqrt (exp_table * 
                       ( (1-p_iplus) %*%
                           t( 1 - p_plusj)
                         )
                     )
resid_table <- resid_num / resid_denom
##    
##           Black      Blonde       Brown        Grey         Red
##   0  1.54091264 -0.48780511 -1.66265543  1.14266218  0.18686634
##   1 -1.08358648  0.44043338  1.01330435 -0.80353244  0.31634883
##   2 -0.31946929 -0.32046855  0.70579286 -0.37312084 -0.52882846
##   3 -0.15202758 -0.60219174  0.57577622 -0.17755903 -0.25165646
##   4 -1.61119575  2.33258881  0.58115284 -0.11521098 -0.16328985
##   5  0.59538425 -0.22460554 -0.50109051 -0.06622599 -0.09386285
##   6  0.59538425 -0.22460554 -0.50109051 -0.06622599 -0.09386285

The table doesn’t suggest these two random variables are independent.

  • Black and Grey hair has strong relationship between 0 siblings

  • Blonde hair people tend to have 4 siblings

Exercise 2.4.

Take the 3 dimensional built-in dataset HairEyeColor from R, and use the margin.table() function to create a 2 dimensional table which just considers hair and eye color (you might need to look at the help() function for HairEyeColor and margin.table()). Create and run a program in R which works out the residuals table for the observed values table you have just created (you should be able to reuse most of the code from Exercise 2.3). Is there any evidence that the two types of color are independent? Use a if statement and the cat() function to output the result of that question.

hair_eye_table <- margin.table(HairEyeColor, c("Hair", "Eye"))

obs_table <- as.table(hair_eye_table)

pro_table <- prop.table (obs_table)

now I’m creating the expected - values table

n_iplus <- rowSums (obs_table)
n_plusj <- colSums (obs_table)
n <- sum (obs_table)

exp_table <- n_iplus %*% t (n_plusj) / n
exp_table
##          Brown      Blue    Hazel     Green
## [1,]  40.13514  39.22297 16.96622 11.675676
## [2,] 106.28378 103.86824 44.92905 30.918919
## [3,]  26.38514  25.78547 11.15372  7.675676
## [4,]  47.19595  46.12331 19.95101 13.729730

now I’m creating the residual table

resid_num <- obs_table - exp_table
p_iplus <- rowSums (pro_table)
p_plusj <- colSums (pro_table)

resid_denom <- sqrt( exp_table * (1 - p_iplus) %*% t (1 - p_plusj)
)
resid_table <- resid_num / resid_denom
resid_table
##        Eye
## Hair         Brown       Blue      Hazel      Green
##   Black  6.1365197 -4.2538155 -0.5750261 -2.2878960
##   Brown  2.1642824 -3.3978828  2.0502162 -0.5082630
##   Red   -0.1008242 -2.3110523  0.9895118  2.5765686
##   Blond -8.3282483  9.9675501 -2.7379770  0.7320232

now I’m converting the residual table into a matrix for comparison (using the rule of thumb)

resid_matrix <- as.matrix(resid_table)

if (any(resid_matrix > 2)) {
  cat("Two types of color are dependent.\n")
} else {
  cat("Two types of color are independent.\n")
}
## Two types of color are dependent.

\(\blacksquare\)

Or, I can use a chi-squared test:

chi_squared_test <- chisq.test(hair_eye_table)

p_value <- chi_squared_test$p.value

if (p_value < 0.05) {
  cat("There is evidence that hair and eye color are not independent (with p-value =", p_value, ")")
} else {
  cat("There is no evidence that hair and eye color are dependent (with p-value =", p_value, ")")
}
## There is evidence that hair and eye color are not independent (with p-value = 2.325287e-25 )

\(\blacksquare\)

Exercise 2.5.

Create the function ct.resids(). Test that it works correctly by running it against Siblings and Hair.

Try running the function against other pairs of data from the student survey.

Are there any other indications of associations between variables?

Use the function

ct.resids <- function (data_1, data_2) {
  obs_table <- table (data_1, data_2)
  pro_table <- prop.table (obs_table)
  
  n_iplus <- rowSums (obs_table)
  n_plusj <- colSums (obs_table)
  n <- sum (obs_table)
  
  exp_table <- n_iplus %*% t (n_plusj) / n
  
  resid_num <- obs_table - exp_table
  p_iplus <- rowSums (pro_table)
  p_plusj <- colSums (pro_table)
  
  resid_denom <- sqrt( exp_table * (1 - p_iplus) %*% t (1 - p_plusj)
  )
  resid_table <- resid_num / resid_denom
  resid_table
}

I compare Siblings and Hair

SHtable <- ct.resids (data_1 = qanda$Siblings, data_2 = qanda$Hair)
SHtable
##       data_2
## data_1       Black      Blonde       Brown        Grey         Red
##      0  1.54091264 -0.48780511 -1.66265543  1.14266218  0.18686634
##      1 -1.08358648  0.44043338  1.01330435 -0.80353244  0.31634883
##      2 -0.31946929 -0.32046855  0.70579286 -0.37312084 -0.52882846
##      3 -0.15202758 -0.60219174  0.57577622 -0.17755903 -0.25165646
##      4 -1.61119575  2.33258881  0.58115284 -0.11521098 -0.16328985
##      5  0.59538425 -0.22460554 -0.50109051 -0.06622599 -0.09386285
##      6  0.59538425 -0.22460554 -0.50109051 -0.06622599 -0.09386285

I compare Handspan and Shoes (I round the Handspan to the nearest 5)

Handspan5 <- round( qanda$Handspan / 5) * 5


HStable <- ct.resids (data_1 = Handspan5, data_2 = round(qanda$Shoes, -1))
HStable
##       data_2
## data_1           0          10          20          30          40          50
##     0  -0.52144070 -1.05590686  2.00436206 -0.19024563 -0.13332818 -0.09386285
##     5  -0.73904391 -0.07420904  1.06530203 -0.26963732 -0.18896757 -0.13303290
##     10 -0.56418967  0.41412405  0.55733675 -1.19906809 -0.84033242  1.40938296
##     15 -0.64420981  1.10720311 -1.66215606  2.20870820  0.27641244 -0.70053830
##     20  1.05938587 -1.34682535  1.00437639 -0.50843300  0.66810124 -1.69323241
##     25  0.32114997  0.40793991 -1.03779863 -0.67545248 -0.47337146  2.86042027
##     30 -0.52144070  0.95118883 -0.50109051 -0.19024563 -0.13332818 -0.09386285

I compare Handspan and Height (Handspan round to the nearest 5, and round Height to the nearest 10)

Handspan5 <- round (qanda$Handspan / 5) * 5
Height10 <- round (qanda$Height, -1)

HHtable <- ct.resids (data_1 = Handspan5, data_2 = Height10)
HHtable
##       data_2
## data_1         150         160         170         180         190         200
##     0  -0.14939633 -0.71096108 -0.60881786  1.48459664 -0.24529460 -0.06622599
##     5  -0.21174113  0.49728346 -0.86288456  0.57264458 -0.34765886 -0.09386285
##     10 -0.94160571  2.21140282 -0.06917187 -1.45963965 -0.74153344 -0.41740495
##     15 -1.11500818  1.03367614  0.32572516 -0.38959826 -1.11096047 -0.49427264
##     20  1.89649883 -1.19061532  0.48549691  0.21344666 -0.36559621  0.84070009
##     25 -0.53042018 -2.52421266 -1.49333650  2.07390608  4.26517509 -0.23513028
##     30 -0.14939633 -0.71096108  1.64969999 -0.67652505 -0.24529460 -0.06622599

Exercise 2.6.

Start a program to analyze the data in Country_data.csv (on Moodle). Import the data. If you examine the continent column you will see that some of the data are marked as NA (ie missing). That is because the data file uses the abbreviation NA for North America and R has misinterpreted it. Replace the NAs with NAM (hint: you might find the levels() function useful as the class of the continent data is factor).

Explore the data using summary statistics and plots. In particular, plot histograms of population and log(population).

cd <- read.csv ("Country_data.csv")

cd$continent[is.na(cd$continent)] <- "NAMe"
# is.na() creates a logical vector indicating which elements in the  are NA

cd$continent <- as.factor (cd$continent)

cd$continent
##  [1] E    AF   O    SA   AF   SA   A    NAMe SA   AF   O    E    NAMe AF   SA  
## [16] NAMe A    A    E    NAMe AF   A    O    NAMe E    O    O    AF   A    E   
## [31] SA   A    E    NAMe SA   O   
## Levels: A AF E NAMe O SA
cd
##                 country continent population gdp_head age_median kcals_day
## 1               Albania         E    3142000     1526       28.5      2917
## 2                Angola        AF   16618000      404       16.7      1916
## 3             Australia         O   20395000    23929       36.6      3138
## 4                Brazil        SA  183383000     3977       27.0      3082
## 5                  Chad        AF   10019000      308       16.9      2019
## 6                 Chile        SA   16267000     5979       30.6      2946
## 7                 China         A 1303719936     1464       32.1      2974
## 8                  Cuba      NAMe   11243000     3470       35.5      3300
## 9               Ecuador        SA   13250000     1562       24.0      2316
## 10                Egypt        AF   73044000     1600       22.4      3161
## 11                 Fiji         O     825000     2308       23.7      3025
## 12              Germany         E   82464000    23564       42.1      3524
## 13              Grenada      NAMe     103000     6366       23.3      2384
## 14        Guinea-Bissau        AF    1326000      154       18.8      2274
## 15               Guyana        SA     758000      989       26.1      2754
## 16                Haiti      NAMe    9410000      381       20.3      1825
## 17                India         A 1115318016      578       23.7      2235
## 18               Israel         A    6930000    19968       28.8      3573
## 19       Macedonia, FYR         E    2038500     1902       34.2      2869
## 20               Mexico      NAMe  103947000     5983       25.5      3223
## 21           Mozambique        AF   19420000      312       17.8      2055
## 22                Nepal         A   25343000      237       20.1      2355
## 23          New Zealand         O    4134000    15172       35.6      3147
## 24            Nicaragua      NAMe    5450000     1100       20.4      2401
## 25               Poland         E   38161000     5224       36.8      3377
## 26                Samoa         O     183000     1742       19.0      2883
## 27      Solomon Islands         O     490000      976       19.4      2435
## 28         South Africa        AF   47335000     3398       23.9      2965
## 29          South Korea         A   48138000    13802       35.0      3066
## 30                Spain         E   43398000    15701       38.7      3308
## 31             Suriname        SA     499000     2359       26.1      2482
## 32 United Arab Emirates         A    3995000    33288       29.5      3066
## 33       United Kingdom         E   60238000    28354       38.9      3431
## 34        United States      NAMe  295896000    37718       36.0      3796
## 35              Uruguay        SA    3306000     6967       32.7      2805
## 36              Vanuatu         O     219000     1356       19.4      2715
summary statistics
summary(cd)[, 3:6]
##    population           gdp_head       age_median      kcals_day   
##  Min.   :1.030e+05   Min.   :  154   Min.   :16.70   Min.   :1825  
##  1st Qu.:2.866e+06   1st Qu.: 1072   1st Qu.:20.38   1st Qu.:2397  
##  Median :1.225e+07   Median : 2334   Median :26.10   Median :2932  
##  Mean   :9.918e+07   Mean   : 7614   Mean   :27.39   Mean   :2826  
##  3rd Qu.:4.754e+07   3rd Qu.: 8676   3rd Qu.:34.40   3rd Qu.:3150  
##  Max.   :1.304e+09   Max.   :37718   Max.   :42.10   Max.   :3796
histogram of populaiton
hist (cd$population, prob = TRUE)

of log population

hist (log(cd$population), prob = TRUE)
lines(density(log(cd$population)), col = "blue")