DataM: Inclass Exercise 0330

library(dplyr)
library(data.table)

In-class exercise 1.

$Target output$

Target output

[Solution and Answer]

Load in the dataset and check the structure.

library(HSAUR3)

Loading required package: tools

data("backpain")
head(backpain)

Use table to count.

temp <- table(backpain[-1]) %>% as_tibble()
temp

Divide temp into two parts by status and merge along the column-axis together.

temp <- merge(temp %>% filter(status == 'case') %>% dplyr::select(-status),
              temp %>% filter(status == 'control') %>% dplyr::select(-status),
              by=c('driver', 'suburban'))
temp

Compute the total count and revise the column names.

colnames(temp)[3:4] <- c('case', 'control')
temp %>% mutate(total = case + control)

In-class exercise 2.

Merge the two data sets: state.x77{datasets} and USArrests{datasets} and compute all pair-wise correlations for numerical variables. Is there anything interesting to report?

[Solution and Answer]

Get basic info of two datasets

library(datasets)
state.x77 %>% as.data.frame %>% str

'data.frame':   50 obs. of  8 variables:
 $ Population: num  3615 365 2212 2110 21198 ...
 $ Income    : num  3624 6315 4530 3378 5114 ...
 $ Illiteracy: num  2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
 $ Life Exp  : num  69 69.3 70.5 70.7 71.7 ...
 $ Murder    : num  15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
 $ HS Grad   : num  41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
 $ Frost     : num  20 152 15 65 20 166 139 103 11 60 ...
 $ Area      : num  50708 566432 113417 51945 156361 ...

USArrests %>% as.data.frame %>% str

'data.frame':   50 obs. of  4 variables:
 $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
 $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
 $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
 $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...

The intersected variable of two dataset is Murder, but we can know that Murder actually contains different data by using help(data_name). Therefore, both Murder should be kept. We should merge two dataset by state, which is currently their rownames.

dta1 <- state.x77 %>% as.data.table %>% mutate(State = rownames(state.x77))
dta2 <- USArrests %>% as.data.table %>% mutate(State = rownames(USArrests))
dta_M <- merge(dta1, dta2, by='State')

Check the no. of variables and the structure

ncol(dta_M) == ncol(dta1) + ncol(dta2) - 1

[1] TRUE

str(dta_M)

'data.frame':   50 obs. of  13 variables:
 $ State     : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
 $ Population: num  3615 365 2212 2110 21198 ...
 $ Income    : num  3624 6315 4530 3378 5114 ...
 $ Illiteracy: num  2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
 $ Life Exp  : num  69 69.3 70.5 70.7 71.7 ...
 $ Murder.x  : num  15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
 $ HS Grad   : num  41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
 $ Frost     : num  20 152 15 65 20 166 139 103 11 60 ...
 $ Area      : num  50708 566432 113417 51945 156361 ...
 $ Murder.y  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
 $ Assault   : int  236 263 294 190 276 204 110 238 335 211 ...
 $ UrbanPop  : int  58 48 80 50 91 78 77 72 80 60 ...
 $ Rape      : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...

Get the correlation coefficients and visualize them.

mtx_cor <- cor(dta_M %>% select(-State)); mtx_cor

            Population      Income  Illiteracy    Life Exp    Murder.x
Population  1.00000000  0.20822756  0.10762237 -0.06805195  0.34364275
Income      0.20822756  1.00000000 -0.43707519  0.34025534 -0.23007761
Illiteracy  0.10762237 -0.43707519  1.00000000 -0.58847793  0.70297520
Life Exp   -0.06805195  0.34025534 -0.58847793  1.00000000 -0.78084575
Murder.x    0.34364275 -0.23007761  0.70297520 -0.78084575  1.00000000
HS Grad    -0.09848975  0.61993232 -0.65718861  0.58221620 -0.48797102
Frost      -0.33215245  0.22628218 -0.67194697  0.26206801 -0.53888344
Area        0.02254384  0.36331544  0.07726113 -0.10733194  0.22839021
Murder.y    0.32024487 -0.21520501  0.70677564 -0.77849850  0.93369089
Assault     0.31702281  0.04093255  0.51101299 -0.62665800  0.73976479
UrbanPop    0.51260491  0.48053302 -0.06219936  0.27146824  0.01638255
Rape        0.30523361  0.35738678  0.15459686 -0.26956828  0.57996132
               HS Grad      Frost        Area    Murder.y     Assault
Population -0.09848975 -0.3321525  0.02254384  0.32024487  0.31702281
Income      0.61993232  0.2262822  0.36331544 -0.21520501  0.04093255
Illiteracy -0.65718861 -0.6719470  0.07726113  0.70677564  0.51101299
Life Exp    0.58221620  0.2620680 -0.10733194 -0.77849850 -0.62665800
Murder.x   -0.48797102 -0.5388834  0.22839021  0.93369089  0.73976479
HS Grad     1.00000000  0.3667797  0.33354187 -0.52159126 -0.23030510
Frost       0.36677970  1.0000000  0.05922910 -0.54139702 -0.46823989
Area        0.33354187  0.0592291  1.00000000  0.14808597  0.23120879
Murder.y   -0.52159126 -0.5413970  0.14808597  1.00000000  0.80187331
Assault    -0.23030510 -0.4682399  0.23120879  0.80187331  1.00000000
UrbanPop    0.35868123 -0.2461862 -0.06154747  0.06957262  0.25887170
Rape        0.27072504 -0.2792054  0.52495510  0.56357883  0.66524123
              UrbanPop       Rape
Population  0.51260491  0.3052336
Income      0.48053302  0.3573868
Illiteracy -0.06219936  0.1545969
Life Exp    0.27146824 -0.2695683
Murder.x    0.01638255  0.5799613
HS Grad     0.35868123  0.2707250
Frost      -0.24618618 -0.2792054
Area       -0.06154747  0.5249551
Murder.y    0.06957262  0.5635788
Assault     0.25887170  0.6652412
UrbanPop    1.00000000  0.4113412
Rape        0.41134124  1.0000000

corrplot::corrplot(mtx_cor)

Among the positive correlation coefficients, the correlation between Murder.x and Murder.y is the highest one. It is not sprised to observe that the murder and non-negligent manslaughter rate in 1976 (Murder in state.x77) highly correlated with the murder rate in 1973 (Murder in USArrests).
As for negative correlation coefficients, it is interesting to see that Life Exp (life expectancy in 1969–71) negatively correlated with Murder.x, Murder.y, Assault, and Rape. This indicates that a state with higher life expectancy would have lower measurements related to the crime.
Life Exp negatively correlated with Illiteracy as well. States with common education and well health care (lead to higher life expectancy) are often states where fewer crimes take place.

In-class exercise 4.

The data set Vocab{car} gives observations on gender, education and vocabulary, from respondents to U.S. General Social Surveys, 1972-2004. Summarize the relationship between education and vocabulary over the years by gender.

[Solution and Answer]

Import the data and check its structure.

library(car)

Loading required package: carData


Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

data('Vocab')
head(Vocab)

str(Vocab)

'data.frame':   30351 obs. of  4 variables:
 $ year      : num  1974 1974 1974 1974 1974 ...
 $ sex       : Factor w/ 2 levels "Female","Male": 2 2 1 1 1 2 2 2 1 1 ...
 $ education : num  14 16 10 10 12 16 17 10 12 11 ...
 $ vocabulary: num  9 9 9 5 8 8 9 5 3 5 ...
 - attr(*, "na.action")= 'omit' Named int  1 2 3 4 5 6 7 8 9 10 ...
  ..- attr(*, "names")= chr  "19720001" "19720002" "19720003" "19720004" ...

Group data into subsets by the gender.

library(dplyr)
library(data.table)
dta4_F <- Vocab %>% dplyr::filter(sex == 'Female') %>% as.data.table()
dta4_M <- Vocab %>% dplyr::filter(sex == 'Male') %>% as.data.table()

Focus on the female subset first.

Visualize raw data of education and vocabulary with xyplot and grouping on years.

lattice::xyplot(vocabulary ~ education, groups=year, data=dta4_F, type=c("g","r"), auto.key=list(columns=6))

It seems that the linear relationships between education and vocabulary for each year are similar.

Compute the correlation coefficients between education and vocabulary for each year. Visualize them.

yearCor_F <- dta4_F[, .(mCor = cor(education, vocabulary)), by=year]
yearCor_F

plot(yearCor_F, xlab='year', ylab='Correlation coefficent',
     main='Visualization of Correlation Coefficents of Education and Vocabulary for Each Year',
     ylim=c(0, 1), xlim=c(min(yearCor_F$year), max(yearCor_F$year)), type='h', lwd=2, xaxt='n')
axis(side=1, at=yearCor_F$year, cex.axis=0.9)
abline(h=c(0, .2, .8), lty=3, col="gray")

Compute the general correlation coefficient betweent education and vocabulary.

cor(dta4_F$education, dta4_F$vocabulary)

[1] 0.4591931

[Conclusion - 1] For years in 1974-2016, there is a positive medium correlation between education and vocabulary in females in the US.

Then focus on male subset.

Visualize raw data of education and vocabulary with xyplot and grouping on years.

lattice::xyplot(vocabulary ~ education, groups=year, data=dta4_F, type=c("g","r"), auto.key=list(columns=6))

Similar to the finding in female data, it seems that the linear relationships between education and vocabulary for each year are similar in male data as well.

Compute the correlation coefficients between education and vocabulary for each year. Visualize them.

yearCor_M <- dta4_M[, .(mCor = cor(education, vocabulary)), by=year]
yearCor_M

plot(yearCor_M, xlab='year', ylab='Correlation coefficent',
     main='Visualization of Correlation Coefficents of Education and Vocabulary for Each Year',
     ylim=c(0, 1), xlim=c(min(yearCor_M$year), max(yearCor_M$year)), type='h', lwd=2, xaxt='n')
axis(side=1, at=yearCor_M$year, cex.axis=0.9)
abline(h=c(0, .2, .8), lty=3, col="gray")

Compute the general correlation coefficient betweent education and vocabulary.

cor(dta4_M$education, dta4_M$vocabulary)

[1] 0.504252

[Conclusion - 2] For years in 1974-2016, there is a positive medium correlation between education and vocabulary in males in the US. The correlation between education and vocabulary is higher for males.

In-class exercise 5.

The MASS library has these two data sets: Animals and mammals. Merge the two files and remove duplicated observations using duplicated.

[Solution and Answer]

Load in the datasets and check their structures.

library(MASS)


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

data("Animals")
head(Animals)

str(Animals)

'data.frame':   28 obs. of  2 variables:
 $ body : num  1.35 465 36.33 27.66 1.04 ...
 $ brain: num  8.1 423 119.5 115 5.5 ...

data("mammals")
head(mammals)

str(mammals)

'data.frame':   62 obs. of  2 variables:
 $ body : num  3.38 0.48 1.35 465 36.33 ...
 $ brain: num  44.5 15.5 8.1 423 119.5 ...

Merge two datasets and find duplicated rows

dta5 <- rbind(Animals, mammals)
dta5[duplicated(dta5),]

Remove duplicated observations using duplicated, which is equivalent to using unique.

str(unique(dta5))

'data.frame':   65 obs. of  2 variables:
 $ body : num  1.35 465 36.33 27.66 1.04 ...
 $ brain: num  8.1 423 119.5 115 5.5 ...

dta5 <- dta5[!duplicated(dta5),]
str(dta5)

'data.frame':   65 obs. of  2 variables:
 $ body : num  1.35 465 36.33 27.66 1.04 ...
 $ brain: num  8.1 423 119.5 115 5.5 ...

In-class exercise 6.

Convert the data set probe words from long to wide format as described.

Dr. Paul Ammon at the University of California at Berkeley studied whether a person’s immediate memory of a sentence is organized according to the sentence’s phrase structure. In the study, subject listened to tape-recorded sentences. Each sentence was followed by a “probe word” taken from one of five positions in the sentence. The subject was to respond with the word that came immediately after the probe word in the sentence.

Source: Timm, N. (1975). Multivariate Analysis, with Applications in Education and Psychology. p. 232.

Column 1: Subject ID
Column 2: Response time
Column 3: Position

wide_format

Column 1: Subject ID
Column 2: Speed of response for the proble word in the first position (transformed reaction time)
Column 3: Speed of response for the proble word in the second position
Column 4: Speed of response for the proble word in the third position
Column 5: Speed of response for the proble word in the fourth position
Column 6: Speed of response for the proble word in the fifth position

[Solution and Answer]

Load in the datasets and check their structures.

dta6 <- read.table('../data/probe_word.txt', header=TRUE)
head(dta6)

str(dta6)

'data.frame':   55 obs. of  3 variables:
 $ ID           : Factor w/ 11 levels "S01","S02","S03",..: 1 1 1 1 1 2 2 2 2 2 ...
 $ Response_Time: int  51 36 50 35 42 27 20 26 17 27 ...
 $ Position     : int  1 2 3 4 5 1 2 3 4 5 ...

Method 1 - Basic R: unstack()

Unstack the data frame without ID column.

dta6_unstack <- unstack(dta6[-1])

Rename rows and columns and then get what we want.

rownames(dta6_unstack) <- unique(dta6$ID)
colnames(dta6_unstack) <- paste0('Pos_', unique(dta6$Position))
dta6_unstack

Since some classmates have proposed this method, I try to figure out the following method:

Method 2 - Basic R: list -> matrix -> data.frame

Use tapply to combine (function c) Response_Time using Position as the index. The output is a list-like array.

l <- tapply(dta6$Response_Time, dta6$Position, c)

Use sapply to turn the list into a matrix with correct dimension, then turn it into a data frame.

dta6_unstack2 <- sapply(l, c) %>% as.data.frame()

Rename rows and columns and then get what we want.

rownames(dta6_unstack2) <- unique(dta6$ID)
colnames(dta6_unstack2) <- paste0('Pos_', unique(dta6$Position))
dta6_unstack2