0330 In-class-exercise 1-3
Exercise 1
Summarize the backpain{HSAUR3} into the following format:
loading data
check data structure
## 'data.frame': 434 obs. of 4 variables:
## $ ID : Factor w/ 217 levels "1","2","3","4",..: 1 1 2 2 3 3 4 4 5 5 ...
## $ status : Factor w/ 2 levels "case","control": 1 2 1 2 1 2 1 2 1 2 ...
## $ driver : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 1 1 2 2 ...
## $ suburban: Factor w/ 2 levels "no","yes": 2 1 2 2 1 2 1 1 1 2 ...
## ID status driver suburban
## 1 1 case yes yes
## 2 1 control yes no
## 3 2 case yes yes
## 4 2 control yes yes
## 5 3 case yes no
## 6 3 control yes yes
grouped by status and summarize the numbers in the groups
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
dta %>%
group_by(driver, suburban) %>%
summarize(case=sum(status=="case"),
control=sum(status=="control"),
tot=n())## # A tibble: 4 x 5
## # Groups: driver [2]
## driver suburban case control tot
## <fct> <fct> <int> <int> <int>
## 1 no no 26 47 73
## 2 no yes 6 7 13
## 3 yes no 64 63 127
## 4 yes yes 121 100 221
Exercise 2
Merge the two data sets: state.x77{datasets} and USArrests{datasets} and compute all pair-wise correlations for numerical variables. Is there anything interesting to report?
loading data
check data detail
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## California 21198 5114 1.1 71.71 10.3 62.6 20 156361
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
rownames as new column
marge dta1 and dta2 by state, remove missing value
check merged dataset
## 'data.frame': 50 obs. of 13 variables:
## $ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ Population: num 3615 365 2212 2110 21198 ...
## $ Income : num 3624 6315 4530 3378 5114 ...
## $ Illiteracy: num 2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
## $ Life.Exp : num 69 69.3 70.5 70.7 71.7 ...
## $ Murder.x : num 15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
## $ HS.Grad : num 41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
## $ Frost : num 20 152 15 65 20 166 139 103 11 60 ...
## $ Area : num 50708 566432 113417 51945 156361 ...
## $ Murder.y : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
## $ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
## $ UrbanPop : int 58 48 80 50 91 78 77 72 80 60 ...
## $ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
## state Population Income Illiteracy Life.Exp Murder.x HS.Grad Frost
## 1 Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## 2 Alaska 365 6315 1.5 69.31 11.3 66.7 152
## 3 Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## 4 Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## 5 California 21198 5114 1.1 71.71 10.3 62.6 20
## 6 Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## Area Murder.y Assault UrbanPop Rape
## 1 50708 13.2 236 58 21.2
## 2 566432 10.0 263 48 44.5
## 3 113417 8.1 294 80 31.0
## 4 51945 8.8 190 50 19.5
## 5 156361 9.0 276 91 40.6
## 6 103766 7.9 204 78 38.7
corrleation note: cor(dtam) : ‘x’ must be numeric
## Population Income Illiteracy Life.Exp Murder.x
## Population 1.00000000 0.20822756 0.10762237 -0.06805195 0.34364275
## Income 0.20822756 1.00000000 -0.43707519 0.34025534 -0.23007761
## Illiteracy 0.10762237 -0.43707519 1.00000000 -0.58847793 0.70297520
## Life.Exp -0.06805195 0.34025534 -0.58847793 1.00000000 -0.78084575
## Murder.x 0.34364275 -0.23007761 0.70297520 -0.78084575 1.00000000
## HS.Grad -0.09848975 0.61993232 -0.65718861 0.58221620 -0.48797102
## Frost -0.33215245 0.22628218 -0.67194697 0.26206801 -0.53888344
## Area 0.02254384 0.36331544 0.07726113 -0.10733194 0.22839021
## Murder.y 0.32024487 -0.21520501 0.70677564 -0.77849850 0.93369089
## Assault 0.31702281 0.04093255 0.51101299 -0.62665800 0.73976479
## UrbanPop 0.51260491 0.48053302 -0.06219936 0.27146824 0.01638255
## Rape 0.30523361 0.35738678 0.15459686 -0.26956828 0.57996132
## HS.Grad Frost Area Murder.y Assault
## Population -0.09848975 -0.3321525 0.02254384 0.32024487 0.31702281
## Income 0.61993232 0.2262822 0.36331544 -0.21520501 0.04093255
## Illiteracy -0.65718861 -0.6719470 0.07726113 0.70677564 0.51101299
## Life.Exp 0.58221620 0.2620680 -0.10733194 -0.77849850 -0.62665800
## Murder.x -0.48797102 -0.5388834 0.22839021 0.93369089 0.73976479
## HS.Grad 1.00000000 0.3667797 0.33354187 -0.52159126 -0.23030510
## Frost 0.36677970 1.0000000 0.05922910 -0.54139702 -0.46823989
## Area 0.33354187 0.0592291 1.00000000 0.14808597 0.23120879
## Murder.y -0.52159126 -0.5413970 0.14808597 1.00000000 0.80187331
## Assault -0.23030510 -0.4682399 0.23120879 0.80187331 1.00000000
## UrbanPop 0.35868123 -0.2461862 -0.06154747 0.06957262 0.25887170
## Rape 0.27072504 -0.2792054 0.52495510 0.56357883 0.66524123
## UrbanPop Rape
## Population 0.51260491 0.3052336
## Income 0.48053302 0.3573868
## Illiteracy -0.06219936 0.1545969
## Life.Exp 0.27146824 -0.2695683
## Murder.x 0.01638255 0.5799613
## HS.Grad 0.35868123 0.2707250
## Frost -0.24618618 -0.2792054
## Area -0.06154747 0.5249551
## Murder.y 0.06957262 0.5635788
## Assault 0.25887170 0.6652412
## UrbanPop 1.00000000 0.4113412
## Rape 0.41134124 1.0000000
1.“Life exp” is negative correlated with “Murder.x”, “Murder.y”, “Illiteracy”, “Assault” and “Rape”.
2. Interestly, “Illiteracy” positively correlated with “Murder.x” and “Murder.y”.
Exercise 3
Supply comments to each code chunk in the following survey rmarkdown file and preview it as an R notebook or knit to html.
The data set concerns species and weight of animals caught in plots in a study area in Arizona over time.
Each row holds information for a single animal, and the columns represent:
- record_id: Unique id for the observation
- month: month of observation
- day: day of observation
- year: year of observation
- plot_id: ID of a particular plot
- species_id: 2-letter code
- sex: sex of animal (“M”, “F”)
- hindfoot_length: length of the hindfoot in mm
- weight: weight of the animal in grams
- genus: genus of animal
- species: species of animal
- taxa: e.g. Rodent, Reptile, Bird, Rabbit
- plot_type: type of plot
using package{packman} to load tidyverse package
loading the dataset via URL by using read.csv and name the dataset as “dta”
## Parsed with column specification:
## cols(
## record_id = col_double(),
## month = col_double(),
## day = col_double(),
## year = col_double(),
## plot_id = col_double(),
## species_id = col_character(),
## sex = col_character(),
## hindfoot_length = col_double(),
## weight = col_double(),
## genus = col_character(),
## species = col_character(),
## taxa = col_character(),
## plot_type = col_character()
## )
check data structure, includes the dimension and the names of variables
## Observations: 34,786
## Variables: 13
## $ record_id <dbl> 1, 72, 224, 266, 349, 363, 435, 506, 588, 661, 748,...
## $ month <dbl> 7, 8, 9, 10, 11, 11, 12, 1, 2, 3, 4, 5, 6, 8, 9, 10...
## $ day <dbl> 16, 19, 13, 16, 12, 12, 10, 8, 18, 11, 8, 6, 9, 5, ...
## $ year <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1978, 197...
## $ plot_id <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ species_id <chr> "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL...
## $ sex <chr> "M", "M", NA, NA, NA, NA, NA, NA, "M", NA, NA, "M",...
## $ hindfoot_length <dbl> 32, 31, NA, NA, NA, NA, NA, NA, NA, NA, NA, 32, NA,...
## $ weight <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 218, NA, NA, 204, 2...
## $ genus <chr> "Neotoma", "Neotoma", "Neotoma", "Neotoma", "Neotom...
## $ species <chr> "albigula", "albigula", "albigula", "albigula", "al...
## $ taxa <chr> "Rodent", "Rodent", "Rodent", "Rodent", "Rodent", "...
## $ plot_type <chr> "Control", "Control", "Control", "Control", "Contro...
information on dimension of data. 34786 rows and 13 columns
## [1] 34786 13
using select function in dplyr package to display first 6 rows information on variables: “plot_id”, “species_id”, “weight” in dta
## # A tibble: 6 x 3
## plot_id species_id weight
## <dbl> <chr> <dbl>
## 1 2 NL NA
## 2 2 NL NA
## 3 2 NL NA
## 4 2 NL NA
## 5 2 NL NA
## 6 2 NL NA
using select function in dplyr package to display first 6 rows information on variables except for: “record_id”, “species_id”
## # A tibble: 6 x 11
## month day year plot_id sex hindfoot_length weight genus species taxa
## <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 7 16 1977 2 M 32 NA Neot~ albigu~ Rode~
## 2 8 19 1977 2 M 31 NA Neot~ albigu~ Rode~
## 3 9 13 1977 2 <NA> NA NA Neot~ albigu~ Rode~
## 4 10 16 1977 2 <NA> NA NA Neot~ albigu~ Rode~
## 5 11 12 1977 2 <NA> NA NA Neot~ albigu~ Rode~
## 6 11 12 1977 2 <NA> NA NA Neot~ albigu~ Rode~
## # ... with 1 more variable: plot_type <chr>
using filter function in dplyr package to display first 6 rows information on assigned year =1995 in dta
## # A tibble: 6 x 13
## record_id month day year plot_id species_id sex hindfoot_length weight
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 22314 6 7 1995 2 NL M 34 NA
## 2 22728 9 23 1995 2 NL F 32 165
## 3 22899 10 28 1995 2 NL F 32 171
## 4 23032 12 2 1995 2 NL F 33 NA
## 5 22003 1 11 1995 2 DM M 37 41
## 6 22042 2 4 1995 2 DM F 36 45
## # ... with 4 more variables: genus <chr>, species <chr>, taxa <chr>,
## # plot_type <chr>
using filter and select function in dplyr package to display first 6 rows information according to the (1) and (2) condition in dta
1. function {filter} picked up those weight less than or equal to 5 in dta
2. function {select} picked up variables: “species_id”, “sex” and “weight”
## # A tibble: 6 x 3
## species_id sex weight
## <chr> <chr> <dbl>
## 1 PF M 5
## 2 PF F 5
## 3 PF F 5
## 4 PF F 4
## 5 PF F 5
## 6 PF F 4
same as the previous explanation
## # A tibble: 6 x 3
## species_id sex weight
## <chr> <chr> <dbl>
## 1 PF M 5
## 2 PF F 5
## 3 PF F 5
## 4 PF F 4
## 5 PF F 5
## 6 PF F 4
Function {mutate} to create new variables to convert unit of weight(g): (1)weight_kg; (2)weight_lb then display first 6 rows of information on dta
## # A tibble: 6 x 15
## record_id month day year plot_id species_id sex hindfoot_length weight
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 7 16 1977 2 NL M 32 NA
## 2 72 8 19 1977 2 NL M 31 NA
## 3 224 9 13 1977 2 NL <NA> NA NA
## 4 266 10 16 1977 2 NL <NA> NA NA
## 5 349 11 12 1977 2 NL <NA> NA NA
## 6 363 11 12 1977 2 NL <NA> NA NA
## # ... with 6 more variables: genus <chr>, species <chr>, taxa <chr>,
## # plot_type <chr>, weight_kg <dbl>, weight_lb <dbl>
using function{filter} to pick value of “weight” is not a missing value then using function{group_by} to group dta by variables: “sex” and “species_id” then sum the mean of weight of each group then function{arrange} to sort descending order by mean_weight
## # A tibble: 6 x 3
## # Groups: sex [3]
## sex species_id mean_weight
## <chr> <chr> <dbl>
## 1 <NA> NL 168.
## 2 M NL 166.
## 3 F NL 154.
## 4 M SS 130
## 5 <NA> SH 130
## 6 M DS 122.
function{group_by} to group dta by “sex” then count observations in each group of sex
## # A tibble: 3 x 2
## sex n
## <chr> <int>
## 1 F 15690
## 2 M 17348
## 3 <NA> 1748
same as previous explanation
## # A tibble: 3 x 2
## sex n
## <chr> <int>
## 1 F 15690
## 2 M 17348
## 3 <NA> 1748
function{group_by} to group dta by “sex” then summarize the total observations in each group
## # A tibble: 3 x 2
## sex count
## <chr> <int>
## 1 F 15690
## 2 M 17348
## 3 <NA> 1748
function{group_by} to group dta by “sex” then summarize the total observations of total number of total non-missing value in each group
## # A tibble: 3 x 2
## sex count
## <chr> <int>
## 1 F 15690
## 2 M 17348
## 3 <NA> 1748
pick weight without missing value then group data by “genus” and “plot_id” then create new variable:mean_weight, calculate the mean of weight in each group and save the variables: “genus”, “plot_id” and mean_weight in the dta_gw
check data structure. Same as str() in base R
## Observations: 196
## Variables: 3
## Groups: genus [10]
## $ genus <chr> "Baiomys", "Baiomys", "Baiomys", "Baiomys", "Baiomys", ...
## $ plot_id <dbl> 1, 2, 3, 5, 18, 19, 20, 21, 1, 2, 3, 4, 5, 6, 7, 8, 9, ...
## $ mean_weight <dbl> 7.000000, 6.000000, 8.611111, 7.750000, 9.500000, 9.533...
using function{spread} to adds a new column for each value of genus, #long form to wide form and save as new dataset:dta_w
check data structure
## Observations: 24
## Variables: 11
## $ plot_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ Baiomys <dbl> 7.000000, 6.000000, 8.611111, NA, 7.750000, NA, NA,...
## $ Chaetodipus <dbl> 22.19939, 25.11014, 24.63636, 23.02381, 17.98276, 2...
## $ Dipodomys <dbl> 60.23214, 55.68259, 52.04688, 57.52454, 51.11356, 5...
## $ Neotoma <dbl> 156.2222, 169.1436, 158.2414, 164.1667, 190.0370, 1...
## $ Onychomys <dbl> 27.67550, 26.87302, 26.03241, 28.09375, 27.01695, 2...
## $ Perognathus <dbl> 9.625000, 6.947368, 7.507812, 7.824427, 8.658537, 7...
## $ Peromyscus <dbl> 22.22222, 22.26966, 21.37037, 22.60000, 21.23171, 2...
## $ Reithrodontomys <dbl> 11.375000, 10.680556, 10.516588, 10.263158, 11.1545...
## $ Sigmodon <dbl> NA, 70.85714, 65.61404, 82.00000, 82.66667, 68.7777...
## $ Spermophilus <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
using function{spread} to adds a new column for each value of genus and mean_weight. Fill the missing value=0 then display the first 6 rows of data
## # A tibble: 6 x 11
## plot_id Baiomys Chaetodipus Dipodomys Neotoma Onychomys Perognathus Peromyscus
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 7 22.2 60.2 156. 27.7 9.62 22.2
## 2 2 6 25.1 55.7 169. 26.9 6.95 22.3
## 3 3 8.61 24.6 52.0 158. 26.0 7.51 21.4
## 4 4 0 23.0 57.5 164. 28.1 7.82 22.6
## 5 5 7.75 18.0 51.1 190. 27.0 8.66 21.2
## 6 6 0 24.9 58.6 180. 25.9 7.81 21.8
## # ... with 3 more variables: Reithrodontomys <dbl>, Sigmodon <dbl>,
## # Spermophilus <dbl>
using function{gather} to collect different class and a single column with values of mean_weight, place them in the genus column and drop out “plot_id” column. Save as dta_1
check dta_1 data structure
## Observations: 240
## Variables: 3
## $ plot_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ genus <chr> "Baiomys", "Baiomys", "Baiomys", "Baiomys", "Baiomys", ...
## $ mean_weight <dbl> 7.000000, 6.000000, 8.611111, NA, 7.750000, NA, NA, NA,...
1.Select columns from Baiomys to Spermophilus
2.collect different class and a single column with values of mean_weight, place them in the genus column
3. Display the first 6 rows data
## # A tibble: 6 x 3
## plot_id genus mean_weight
## <dbl> <chr> <dbl>
## 1 1 Baiomys 7
## 2 2 Baiomys 6
## 3 3 Baiomys 8.61
## 4 4 Baiomys NA
## 5 5 Baiomys 7.75
## 6 6 Baiomys NA
pick data without missing value in wieght, hindfoot_length, sex then save data to dta_complete
First to count the number of species_id in dta_complete. Then pick data with number of species_id more than 50 then save data to species_counts
keep data with those species_id are consisntent with the species_id in dataset:species_count then save as dta_complete