(\(\blacksquare\) indicates the end of the answer of a specific exercise)
First of all, let’s see the qanda data:
## Hair Birth Handspan Siblings Shoes Fish Height
## 1 Black July 20.0 1 10 1 168
## 2 Blonde May 20.0 0 40 1 186
## 3 Black October 15.8 1 37 0 183
## 4 Black December 12.5 1 8 1 165
## 5 Brown March 15.0 1 10 1 160
## 6 Black March 18.3 0 15 1 165
We are now focusing on the relationship between Siblings and Handspan:
table (qanda$Siblings, qanda$Hair)
##
## Black Blonde Brown Grey Red
## 0 79 4 15 1 1
## 1 63 5 21 0 1
## 2 20 1 7 0 0
## 3 5 0 2 0 0
## 4 1 1 1 0 0
## 5 1 0 0 0 0
## 6 1 0 0 0 0
SHtable <- table (qanda$Siblings, qanda$Hair)
I made three tables:
round (prop.table (SHtable), 2)
##
## Black Blonde Brown Grey Red
## 0 0.34 0.02 0.07 0.00 0.00
## 1 0.27 0.02 0.09 0.00 0.00
## 2 0.09 0.00 0.03 0.00 0.00
## 3 0.02 0.00 0.01 0.00 0.00
## 4 0.00 0.00 0.00 0.00 0.00
## 5 0.00 0.00 0.00 0.00 0.00
## 6 0.00 0.00 0.00 0.00 0.00
margin.table(SHtable, 1)
##
## 0 1 2 3 4 5 6
## 100 90 28 7 3 1 1
prop.table (margin.table (SHtable, 2))
##
## Black Blonde Brown Grey Red
## 0.739130435 0.047826087 0.200000000 0.004347826 0.008695652
the first table shows the value of each cell when they are divided by the grand sum
the second table shows the sum of all rows,
the third table shows: after constructing the table showing the column sums, I sum them up, and divide each individual cell (sum of each column) in the table by their sum.
margin.table ( table (1:nrow(qanda),
qanda$Siblings) [,5:7]) / nrow(qanda)
## [1] 0.02173913
\(\blacksquare\)
calculate cell [1,1] - expected value
rowSums(SHtable)[1] * colSums (SHtable)[1] / sum(SHtable)[1]
## 0
## 73.91304
calculate cell [1,1] - residual value
miu11 <- rowSums(SHtable)[1] * colSums (SHtable)[1] / sum(SHtable)[1]
( SHtable[1,1] - miu11) / sqrt( miu11 *
(1 - rowSums (SHtable)[1] / sum(SHtable)) *
(1 - colSums (SHtable)[1] / sum(SHtable))
)
## 0
## 1.540913
If they are independent on each other, the standardized Pearson residuals are approximate to N(0,1).
–> Most of the residuals should lie in the range (-2, 2), in this case, using the rule of thumb, the black hair and no siblings are independent.
While looking at the values of other residuals, they appear to be dependent.
\(\blacksquare\)
For this question, we can use R to speed things up when making contingency table:
niplus <- rowSums (SHtable)
nplusj <- colSums (SHtable)
n <- sum(SHtable)
exp_table <- niplus %*% t(nplusj) /n
## Black Blonde Brown Grey Red
## [1,] 73.9130435 4.78260870 20.0 0.434782609 0.869565217
## [2,] 66.5217391 4.30434783 18.0 0.391304348 0.782608696
## [3,] 20.6956522 1.33913043 5.6 0.121739130 0.243478261
## [4,] 5.1739130 0.33478261 1.4 0.030434783 0.060869565
## [5,] 2.2173913 0.14347826 0.6 0.013043478 0.026086957
## [6,] 0.7391304 0.04782609 0.2 0.004347826 0.008695652
## [7,] 0.7391304 0.04782609 0.2 0.004347826 0.008695652
resid_num <- SHtable - exp_table
p_iplus <- rowSums (prop.table (SHtable))
p_plusj <- colSums (prop.table (SHtable))
resid_denom <- sqrt (exp_table *
( (1-p_iplus) %*%
t( 1 - p_plusj)
)
)
resid_table <- resid_num / resid_denom
##
## Black Blonde Brown Grey Red
## 0 1.54091264 -0.48780511 -1.66265543 1.14266218 0.18686634
## 1 -1.08358648 0.44043338 1.01330435 -0.80353244 0.31634883
## 2 -0.31946929 -0.32046855 0.70579286 -0.37312084 -0.52882846
## 3 -0.15202758 -0.60219174 0.57577622 -0.17755903 -0.25165646
## 4 -1.61119575 2.33258881 0.58115284 -0.11521098 -0.16328985
## 5 0.59538425 -0.22460554 -0.50109051 -0.06622599 -0.09386285
## 6 0.59538425 -0.22460554 -0.50109051 -0.06622599 -0.09386285
The table doesn’t suggest these two random variables are independent.
Black and Grey hair has strong relationship between 0 siblings
Blonde hair people tend to have 4 siblings
hair_eye_table <- margin.table(HairEyeColor, c("Hair", "Eye"))
obs_table <- as.table(hair_eye_table)
pro_table <- prop.table (obs_table)
now I’m creating the expected - values table
n_iplus <- rowSums (obs_table)
n_plusj <- colSums (obs_table)
n <- sum (obs_table)
exp_table <- n_iplus %*% t (n_plusj) / n
exp_table
## Brown Blue Hazel Green
## [1,] 40.13514 39.22297 16.96622 11.675676
## [2,] 106.28378 103.86824 44.92905 30.918919
## [3,] 26.38514 25.78547 11.15372 7.675676
## [4,] 47.19595 46.12331 19.95101 13.729730
now I’m creating the residual table
resid_num <- obs_table - exp_table
p_iplus <- rowSums (pro_table)
p_plusj <- colSums (pro_table)
resid_denom <- sqrt( exp_table * (1 - p_iplus) %*% t (1 - p_plusj)
)
resid_table <- resid_num / resid_denom
resid_table
## Eye
## Hair Brown Blue Hazel Green
## Black 6.1365197 -4.2538155 -0.5750261 -2.2878960
## Brown 2.1642824 -3.3978828 2.0502162 -0.5082630
## Red -0.1008242 -2.3110523 0.9895118 2.5765686
## Blond -8.3282483 9.9675501 -2.7379770 0.7320232
now I’m converting the residual table into a matrix for comparison (using the rule of thumb)
resid_matrix <- as.matrix(resid_table)
if (any(resid_matrix > 2)) {
cat("Two types of color are dependent.\n")
} else {
cat("Two types of color are independent.\n")
}
## Two types of color are dependent.
\(\blacksquare\)
Or, I can use a chi-squared test:
chi_squared_test <- chisq.test(hair_eye_table)
p_value <- chi_squared_test$p.value
if (p_value < 0.05) {
cat("There is evidence that hair and eye color are not independent (with p-value =", p_value, ")")
} else {
cat("There is no evidence that hair and eye color are dependent (with p-value =", p_value, ")")
}
## There is evidence that hair and eye color are not independent (with p-value = 2.325287e-25 )
\(\blacksquare\)
Use the function
ct.resids <- function (data_1, data_2) {
obs_table <- table (data_1, data_2)
pro_table <- prop.table (obs_table)
n_iplus <- rowSums (obs_table)
n_plusj <- colSums (obs_table)
n <- sum (obs_table)
exp_table <- n_iplus %*% t (n_plusj) / n
resid_num <- obs_table - exp_table
p_iplus <- rowSums (pro_table)
p_plusj <- colSums (pro_table)
resid_denom <- sqrt( exp_table * (1 - p_iplus) %*% t (1 - p_plusj)
)
resid_table <- resid_num / resid_denom
resid_table
}
I compare Siblings and Hair
SHtable <- ct.resids (data_1 = qanda$Siblings, data_2 = qanda$Hair)
SHtable
## data_2
## data_1 Black Blonde Brown Grey Red
## 0 1.54091264 -0.48780511 -1.66265543 1.14266218 0.18686634
## 1 -1.08358648 0.44043338 1.01330435 -0.80353244 0.31634883
## 2 -0.31946929 -0.32046855 0.70579286 -0.37312084 -0.52882846
## 3 -0.15202758 -0.60219174 0.57577622 -0.17755903 -0.25165646
## 4 -1.61119575 2.33258881 0.58115284 -0.11521098 -0.16328985
## 5 0.59538425 -0.22460554 -0.50109051 -0.06622599 -0.09386285
## 6 0.59538425 -0.22460554 -0.50109051 -0.06622599 -0.09386285
I compare Handspan and Shoes (I round the Handspan to the nearest 5)
Handspan5 <- round( qanda$Handspan / 5) * 5
HStable <- ct.resids (data_1 = Handspan5, data_2 = round(qanda$Shoes, -1))
HStable
## data_2
## data_1 0 10 20 30 40 50
## 0 -0.52144070 -1.05590686 2.00436206 -0.19024563 -0.13332818 -0.09386285
## 5 -0.73904391 -0.07420904 1.06530203 -0.26963732 -0.18896757 -0.13303290
## 10 -0.56418967 0.41412405 0.55733675 -1.19906809 -0.84033242 1.40938296
## 15 -0.64420981 1.10720311 -1.66215606 2.20870820 0.27641244 -0.70053830
## 20 1.05938587 -1.34682535 1.00437639 -0.50843300 0.66810124 -1.69323241
## 25 0.32114997 0.40793991 -1.03779863 -0.67545248 -0.47337146 2.86042027
## 30 -0.52144070 0.95118883 -0.50109051 -0.19024563 -0.13332818 -0.09386285
I compare Handspan and Height (Handspan round to the nearest 5, and round Height to the nearest 10)
Handspan5 <- round (qanda$Handspan / 5) * 5
Height10 <- round (qanda$Height, -1)
HHtable <- ct.resids (data_1 = Handspan5, data_2 = Height10)
HHtable
## data_2
## data_1 150 160 170 180 190 200
## 0 -0.14939633 -0.71096108 -0.60881786 1.48459664 -0.24529460 -0.06622599
## 5 -0.21174113 0.49728346 -0.86288456 0.57264458 -0.34765886 -0.09386285
## 10 -0.94160571 2.21140282 -0.06917187 -1.45963965 -0.74153344 -0.41740495
## 15 -1.11500818 1.03367614 0.32572516 -0.38959826 -1.11096047 -0.49427264
## 20 1.89649883 -1.19061532 0.48549691 0.21344666 -0.36559621 0.84070009
## 25 -0.53042018 -2.52421266 -1.49333650 2.07390608 4.26517509 -0.23513028
## 30 -0.14939633 -0.71096108 1.64969999 -0.67652505 -0.24529460 -0.06622599
cd <- read.csv ("Country_data.csv")
cd$continent[is.na(cd$continent)] <- "NAMe"
# is.na() creates a logical vector indicating which elements in the are NA
cd$continent <- as.factor (cd$continent)
cd$continent
## [1] E AF O SA AF SA A NAMe SA AF O E NAMe AF SA
## [16] NAMe A A E NAMe AF A O NAMe E O O AF A E
## [31] SA A E NAMe SA O
## Levels: A AF E NAMe O SA
cd
## country continent population gdp_head age_median kcals_day
## 1 Albania E 3142000 1526 28.5 2917
## 2 Angola AF 16618000 404 16.7 1916
## 3 Australia O 20395000 23929 36.6 3138
## 4 Brazil SA 183383000 3977 27.0 3082
## 5 Chad AF 10019000 308 16.9 2019
## 6 Chile SA 16267000 5979 30.6 2946
## 7 China A 1303719936 1464 32.1 2974
## 8 Cuba NAMe 11243000 3470 35.5 3300
## 9 Ecuador SA 13250000 1562 24.0 2316
## 10 Egypt AF 73044000 1600 22.4 3161
## 11 Fiji O 825000 2308 23.7 3025
## 12 Germany E 82464000 23564 42.1 3524
## 13 Grenada NAMe 103000 6366 23.3 2384
## 14 Guinea-Bissau AF 1326000 154 18.8 2274
## 15 Guyana SA 758000 989 26.1 2754
## 16 Haiti NAMe 9410000 381 20.3 1825
## 17 India A 1115318016 578 23.7 2235
## 18 Israel A 6930000 19968 28.8 3573
## 19 Macedonia, FYR E 2038500 1902 34.2 2869
## 20 Mexico NAMe 103947000 5983 25.5 3223
## 21 Mozambique AF 19420000 312 17.8 2055
## 22 Nepal A 25343000 237 20.1 2355
## 23 New Zealand O 4134000 15172 35.6 3147
## 24 Nicaragua NAMe 5450000 1100 20.4 2401
## 25 Poland E 38161000 5224 36.8 3377
## 26 Samoa O 183000 1742 19.0 2883
## 27 Solomon Islands O 490000 976 19.4 2435
## 28 South Africa AF 47335000 3398 23.9 2965
## 29 South Korea A 48138000 13802 35.0 3066
## 30 Spain E 43398000 15701 38.7 3308
## 31 Suriname SA 499000 2359 26.1 2482
## 32 United Arab Emirates A 3995000 33288 29.5 3066
## 33 United Kingdom E 60238000 28354 38.9 3431
## 34 United States NAMe 295896000 37718 36.0 3796
## 35 Uruguay SA 3306000 6967 32.7 2805
## 36 Vanuatu O 219000 1356 19.4 2715
summary(cd)[, 3:6]
## population gdp_head age_median kcals_day
## Min. :1.030e+05 Min. : 154 Min. :16.70 Min. :1825
## 1st Qu.:2.866e+06 1st Qu.: 1072 1st Qu.:20.38 1st Qu.:2397
## Median :1.225e+07 Median : 2334 Median :26.10 Median :2932
## Mean :9.918e+07 Mean : 7614 Mean :27.39 Mean :2826
## 3rd Qu.:4.754e+07 3rd Qu.: 8676 3rd Qu.:34.40 3rd Qu.:3150
## Max. :1.304e+09 Max. :37718 Max. :42.10 Max. :3796
hist (cd$population, prob = TRUE)
of log population
hist (log(cd$population), prob = TRUE)
lines(density(log(cd$population)), col = "blue")