DataM: Inclass Exercise 0330
In-class exercise 1.
Target output
[Solution and Answer]
- Load in the dataset and check the structure.
Loading required package: tools
- Use
tableto count.
- Divide
tempinto two parts bystatusand merge along the column-axis together.
temp <- merge(temp %>% filter(status == 'case') %>% dplyr::select(-status),
temp %>% filter(status == 'control') %>% dplyr::select(-status),
by=c('driver', 'suburban'))
temp- Compute the total count and revise the column names.
In-class exercise 2.
Merge the two data sets: state.x77{datasets} and USArrests{datasets} and compute all pair-wise correlations for numerical variables. Is there anything interesting to report?
[Solution and Answer]
- Get basic info of two datasets
'data.frame': 50 obs. of 8 variables:
$ Population: num 3615 365 2212 2110 21198 ...
$ Income : num 3624 6315 4530 3378 5114 ...
$ Illiteracy: num 2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
$ Life Exp : num 69 69.3 70.5 70.7 71.7 ...
$ Murder : num 15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
$ HS Grad : num 41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
$ Frost : num 20 152 15 65 20 166 139 103 11 60 ...
$ Area : num 50708 566432 113417 51945 156361 ...
'data.frame': 50 obs. of 4 variables:
$ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
$ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
$ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
$ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
The intersected variable of two dataset is Murder, but we can know that Murder actually contains different data by using help(data_name). Therefore, both Murder should be kept. We should merge two dataset by state, which is currently their rownames.
dta1 <- state.x77 %>% as.data.table %>% mutate(State = rownames(state.x77))
dta2 <- USArrests %>% as.data.table %>% mutate(State = rownames(USArrests))
dta_M <- merge(dta1, dta2, by='State')- Check the no. of variables and the structure
[1] TRUE
'data.frame': 50 obs. of 13 variables:
$ State : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
$ Population: num 3615 365 2212 2110 21198 ...
$ Income : num 3624 6315 4530 3378 5114 ...
$ Illiteracy: num 2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
$ Life Exp : num 69 69.3 70.5 70.7 71.7 ...
$ Murder.x : num 15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
$ HS Grad : num 41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
$ Frost : num 20 152 15 65 20 166 139 103 11 60 ...
$ Area : num 50708 566432 113417 51945 156361 ...
$ Murder.y : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
$ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
$ UrbanPop : int 58 48 80 50 91 78 77 72 80 60 ...
$ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
- Get the correlation coefficients and visualize them.
Population Income Illiteracy Life Exp Murder.x
Population 1.00000000 0.20822756 0.10762237 -0.06805195 0.34364275
Income 0.20822756 1.00000000 -0.43707519 0.34025534 -0.23007761
Illiteracy 0.10762237 -0.43707519 1.00000000 -0.58847793 0.70297520
Life Exp -0.06805195 0.34025534 -0.58847793 1.00000000 -0.78084575
Murder.x 0.34364275 -0.23007761 0.70297520 -0.78084575 1.00000000
HS Grad -0.09848975 0.61993232 -0.65718861 0.58221620 -0.48797102
Frost -0.33215245 0.22628218 -0.67194697 0.26206801 -0.53888344
Area 0.02254384 0.36331544 0.07726113 -0.10733194 0.22839021
Murder.y 0.32024487 -0.21520501 0.70677564 -0.77849850 0.93369089
Assault 0.31702281 0.04093255 0.51101299 -0.62665800 0.73976479
UrbanPop 0.51260491 0.48053302 -0.06219936 0.27146824 0.01638255
Rape 0.30523361 0.35738678 0.15459686 -0.26956828 0.57996132
HS Grad Frost Area Murder.y Assault
Population -0.09848975 -0.3321525 0.02254384 0.32024487 0.31702281
Income 0.61993232 0.2262822 0.36331544 -0.21520501 0.04093255
Illiteracy -0.65718861 -0.6719470 0.07726113 0.70677564 0.51101299
Life Exp 0.58221620 0.2620680 -0.10733194 -0.77849850 -0.62665800
Murder.x -0.48797102 -0.5388834 0.22839021 0.93369089 0.73976479
HS Grad 1.00000000 0.3667797 0.33354187 -0.52159126 -0.23030510
Frost 0.36677970 1.0000000 0.05922910 -0.54139702 -0.46823989
Area 0.33354187 0.0592291 1.00000000 0.14808597 0.23120879
Murder.y -0.52159126 -0.5413970 0.14808597 1.00000000 0.80187331
Assault -0.23030510 -0.4682399 0.23120879 0.80187331 1.00000000
UrbanPop 0.35868123 -0.2461862 -0.06154747 0.06957262 0.25887170
Rape 0.27072504 -0.2792054 0.52495510 0.56357883 0.66524123
UrbanPop Rape
Population 0.51260491 0.3052336
Income 0.48053302 0.3573868
Illiteracy -0.06219936 0.1545969
Life Exp 0.27146824 -0.2695683
Murder.x 0.01638255 0.5799613
HS Grad 0.35868123 0.2707250
Frost -0.24618618 -0.2792054
Area -0.06154747 0.5249551
Murder.y 0.06957262 0.5635788
Assault 0.25887170 0.6652412
UrbanPop 1.00000000 0.4113412
Rape 0.41134124 1.0000000
Among the positive correlation coefficients, the correlation between
Murder.xandMurder.yis the highest one. It is not sprised to observe that the murder and non-negligent manslaughter rate in 1976 (Murderinstate.x77) highly correlated with the murder rate in 1973 (MurderinUSArrests).As for negative correlation coefficients, it is interesting to see that
Life Exp(life expectancy in 1969–71) negatively correlated withMurder.x,Murder.y,Assault, andRape. This indicates that a state with higher life expectancy would have lower measurements related to the crime.Life Expnegatively correlated withIlliteracyas well. States with common education and well health care (lead to higher life expectancy) are often states where fewer crimes take place.
In-class exercise 4.
The data set Vocab{car} gives observations on gender, education and vocabulary, from respondents to U.S. General Social Surveys, 1972-2004. Summarize the relationship between education and vocabulary over the years by gender.
[Solution and Answer]
Import the data and check its structure.
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
'data.frame': 30351 obs. of 4 variables:
$ year : num 1974 1974 1974 1974 1974 ...
$ sex : Factor w/ 2 levels "Female","Male": 2 2 1 1 1 2 2 2 1 1 ...
$ education : num 14 16 10 10 12 16 17 10 12 11 ...
$ vocabulary: num 9 9 9 5 8 8 9 5 3 5 ...
- attr(*, "na.action")= 'omit' Named int 1 2 3 4 5 6 7 8 9 10 ...
..- attr(*, "names")= chr "19720001" "19720002" "19720003" "19720004" ...
- Group data into subsets by the gender.
Focus on the female subset first.
- Visualize raw data of
educationandvocabularywith xyplot and grouping on years.
It seems that the linear relationships between education and vocabulary for each year are similar.
- Compute the correlation coefficients between
educationandvocabularyfor each year. Visualize them.
plot(yearCor_F, xlab='year', ylab='Correlation coefficent',
main='Visualization of Correlation Coefficents of Education and Vocabulary for Each Year',
ylim=c(0, 1), xlim=c(min(yearCor_F$year), max(yearCor_F$year)), type='h', lwd=2, xaxt='n')
axis(side=1, at=yearCor_F$year, cex.axis=0.9)
abline(h=c(0, .2, .8), lty=3, col="gray")- Compute the general correlation coefficient betweent
educationandvocabulary.
[1] 0.4591931
[Conclusion - 1] For years in 1974-2016, there is a positive medium correlation between education and vocabulary in females in the US.
Then focus on male subset.
- Visualize raw data of
educationandvocabularywith xyplot and grouping on years.
Similar to the finding in female data, it seems that the linear relationships between education and vocabulary for each year are similar in male data as well.
- Compute the correlation coefficients between
educationandvocabularyfor each year. Visualize them.
plot(yearCor_M, xlab='year', ylab='Correlation coefficent',
main='Visualization of Correlation Coefficents of Education and Vocabulary for Each Year',
ylim=c(0, 1), xlim=c(min(yearCor_M$year), max(yearCor_M$year)), type='h', lwd=2, xaxt='n')
axis(side=1, at=yearCor_M$year, cex.axis=0.9)
abline(h=c(0, .2, .8), lty=3, col="gray")- Compute the general correlation coefficient betweent
educationandvocabulary.
[1] 0.504252
[Conclusion - 2] For years in 1974-2016, there is a positive medium correlation between education and vocabulary in males in the US. The correlation between education and vocabulary is higher for males.
In-class exercise 5.
The MASS library has these two data sets: Animals and mammals. Merge the two files and remove duplicated observations using duplicated.
[Solution and Answer]
- Load in the datasets and check their structures.
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
'data.frame': 28 obs. of 2 variables:
$ body : num 1.35 465 36.33 27.66 1.04 ...
$ brain: num 8.1 423 119.5 115 5.5 ...
'data.frame': 62 obs. of 2 variables:
$ body : num 3.38 0.48 1.35 465 36.33 ...
$ brain: num 44.5 15.5 8.1 423 119.5 ...
- Merge two datasets and find duplicated rows
- Remove duplicated observations using
duplicated, which is equivalent to usingunique.
'data.frame': 65 obs. of 2 variables:
$ body : num 1.35 465 36.33 27.66 1.04 ...
$ brain: num 8.1 423 119.5 115 5.5 ...
'data.frame': 65 obs. of 2 variables:
$ body : num 1.35 465 36.33 27.66 1.04 ...
$ brain: num 8.1 423 119.5 115 5.5 ...
In-class exercise 6.
Convert the data set probe words from long to wide format as described.
Dr. Paul Ammon at the University of California at Berkeley studied whether a person’s immediate memory of a sentence is organized according to the sentence’s phrase structure. In the study, subject listened to tape-recorded sentences. Each sentence was followed by a “probe word” taken from one of five positions in the sentence. The subject was to respond with the word that came immediately after the probe word in the sentence.
Source: Timm, N. (1975). Multivariate Analysis, with Applications in Education and Psychology. p. 232.
- Column 1: Subject ID
- Column 2: Response time
- Column 3: Position
wide_format
- Column 1: Subject ID
- Column 2: Speed of response for the proble word in the first position (transformed reaction time)
- Column 3: Speed of response for the proble word in the second position
- Column 4: Speed of response for the proble word in the third position
- Column 5: Speed of response for the proble word in the fourth position
- Column 6: Speed of response for the proble word in the fifth position
[Solution and Answer]
Load in the datasets and check their structures.
'data.frame': 55 obs. of 3 variables:
$ ID : Factor w/ 11 levels "S01","S02","S03",..: 1 1 1 1 1 2 2 2 2 2 ...
$ Response_Time: int 51 36 50 35 42 27 20 26 17 27 ...
$ Position : int 1 2 3 4 5 1 2 3 4 5 ...
Method 1 - Basic R: unstack()
- Unstack the data frame without ID column.
- Rename rows and columns and then get what we want.
rownames(dta6_unstack) <- unique(dta6$ID)
colnames(dta6_unstack) <- paste0('Pos_', unique(dta6$Position))
dta6_unstackSince some classmates have proposed this method, I try to figure out the following method:
Method 2 - Basic R: list -> matrix -> data.frame
- Use
tapplyto combine (functionc)Response_TimeusingPositionas the index. The output is a list-like array.
- Use
sapplyto turn the list into a matrix with correct dimension, then turn it into a data frame.
- Rename rows and columns and then get what we want.
rownames(dta6_unstack2) <- unique(dta6$ID)
colnames(dta6_unstack2) <- paste0('Pos_', unique(dta6$Position))
dta6_unstack2