3.30Data Wrangling homework4~6

Homework4

Reverse the order of input to the series of dplyr::*_join examples using data from the Nobel laureates in literature and explain the resulting output.

library(dplyr)

They share “Year” on the column.

nbl_c <- read.table("C:/Users/user/Desktop/nobel_countries.txt", h = T)
nbl_w <- read.table("C:/Users/user/Desktop/nobel_winners.txt", h = T)
head(nbl_c)

##   Country Year
## 1  France 2014
## 2      UK 1950
## 3      UK 2017
## 4      US 2016
## 5  Canada 2013
## 6   China 2012

head(nbl_w)

##                Name Gender Year
## 1   Patrick Modiano   Male 2014
## 2 Bertrand  Russell   Male 1950
## 3    Kazuo Ishiguro   Male 2017
## 4        Bob  Dylan   Male 2016
## 5      Alice  Munro Female 2013
## 6            Mo Yan   Male 2012

inner join The column “Year” is the column both datasets share.

We merge two datasets by “Year” and select only rows that match with two datasets is an “inner join”

we keep all columns from “nbl_w” and “nbl_c.”

dplyr::inner_join(nbl_c,nbl_w)%>% arrange(Year)

## Joining, by = "Year"

##   Country Year              Name Gender
## 1      UK 1950 Bertrand  Russell   Male
## 2   China 2012            Mo Yan   Male
## 3  Canada 2013      Alice  Munro Female
## 4  France 2014   Patrick Modiano   Male
## 5      US 2016        Bob  Dylan   Male
## 6      UK 2017    Kazuo Ishiguro   Male

semi join Return all rows from “nbl_c” where there are matching values in nbl_w.

With semi join, we keeping just columns from nbl_c.

Comparing with inner join, we omit the column “Name” and “Gender”

semi_join(nbl_c,nbl_w)%>% arrange(Year)

## Joining, by = "Year"

##   Country Year
## 1      UK 1950
## 2   China 2012
## 3  Canada 2013
## 4  France 2014
## 5      US 2016
## 6      UK 2017

left_join(nbl_c, nbl_w) The order of input is nbl_c then nbl_w, reverse order from teacher’s example.

Include everything on the left (what was the nbl_c data frame ) and all rows that match from the right (nbl_w) data frame.

Keep all rows from nbl_c , and all columns from nbl_c and nbl_w.(Year, Gender, Name. Country). All combination of the matches are returned.

left_join(nbl_c,nbl_w)%>% arrange(Year)

## Joining, by = "Year"

##   Country Year              Name Gender
## 1      UK 1950 Bertrand  Russell   Male
## 2  Sweden 2011              <NA>   <NA>
## 3   China 2012            Mo Yan   Male
## 4  Canada 2013      Alice  Munro Female
## 5  France 2014   Patrick Modiano   Male
## 6  Russia 2015              <NA>   <NA>
## 7      US 2016        Bob  Dylan   Male
## 8      UK 2017    Kazuo Ishiguro   Male

anti_join Finding rows that didn’t match is an anti join.

Return all rows from nbl_c where there are not matching values in nbl_w, keeping just columns from nbl_c.(Year, Country).

The result is the opposite of inner join

anti_join(nbl_c,nbl_w)

## Joining, by = "Year"

##   Country Year
## 1  Russia 2015
## 2  Sweden 2011

full_join

Return all rows and all columns from both nbl_c and nbl_w.

The result will be the same to basic R (merge() with all = TRUE)

full_join(nbl_c, nbl_w)

## Joining, by = "Year"

##   Country Year              Name Gender
## 1  France 2014   Patrick Modiano   Male
## 2      UK 1950 Bertrand  Russell   Male
## 3      UK 2017    Kazuo Ishiguro   Male
## 4      US 2016        Bob  Dylan   Male
## 5  Canada 2013      Alice  Munro Female
## 6   China 2012            Mo Yan   Male
## 7  Russia 2015              <NA>   <NA>
## 8  Sweden 2011              <NA>   <NA>
## 9    <NA> 1938        Pearl Buck Female

Homework5

Augment the data object in the ‘SAT’ lecture note with state.division{datasets}. For each of the 9 divisions, find the slope estimate for regressing average SAT scores onto average teacher’s salary. How many of them are of negative signs?

library(datasets)
library(lattice)

fL <- "http://www.amstat.org/publications/jse/datasets/sat.dat.txt"
dta <- read.table(fL, row.names=1)
head(dta)

##               V2   V3     V4 V5  V6  V7   V8
## Alabama    4.405 17.2 31.144  8 491 538 1029
## Alaska     8.963 17.6 47.951 47 445 489  934
## Arizona    4.778 19.3 32.175 27 448 496  944
## Arkansas   4.459 17.1 28.934  6 482 523 1005
## California 4.992 24.0 41.078 45 417 485  902
## Colorado   5.443 18.4 34.571 29 462 518  980

Re-code variable names for clarity

names(dta) <- c("Spending", "PTR", "Salary", "PE", "Verbal", "Math", "SAT")

Augment data with the built-in state.division{datasets}

There are 9 state divisions with the numbers of schools in each division.

dta$divisions <- state.division
with(dta, table(divisions))

## divisions
##        New England    Middle Atlantic     South Atlantic East South Central 
##                  6                  3                  8                  4 
## West South Central East North Central West North Central           Mountain 
##                  4                  5                  7                  8 
##            Pacific 
##                  5

lattice::xyplot(SAT ~ Salary, groups=divisions, data=dta, type=c('r', 'g'), auto.key=list(columns=3))

dta %>%
  group_by(divisions) %>%
  dplyr::summarize(r=cor(SAT, Salary))

## # A tibble: 9 x 2
##   divisions                r
##   <fct>                <dbl>
## 1 New England        -0.0830
## 2 Middle Atlantic     0.662 
## 3 South Atlantic      0.489 
## 4 East South Central -0.372 
## 5 West South Central -0.884 
## 6 East North Central  0.524 
## 7 West North Central -0.206 
## 8 Mountain           -0.729 
## 9 Pacific             0.0649

conclusion There are five divisions that average SAT scores onto average teacher’s salary are negative sins which includes New England, East South Central, West South Central, West North Central and mountain.

Homework6

The following R script is used to manage the data file at the initial stage of investigation. Provide comments on what each line of the script is meant to achieve.

run the code(eval=TRUE) but not display the code(echo=FALSE) in the output file
a string setting the prompt used for lines which continue more than one line.
The digits option controls how many digits are printed (3)
The width option says how many characters R should put on a line.

# echo=FALSE,eval=TRUE
options(continue="  ")
options(digits=3)
options(width=72) # narrow output

read the data from an online file
load “dplyr”
select some variables and name the dataframe “newds”

ds = read.csv("http://www.amherst.edu/~nhorton/r2/datasets/help.csv")
library(dplyr)
newds = select(ds, cesd, female, i1, i2, id, treat, f1a, f1b, f1c, f1d, f1e, 
  f1f, f1g, f1h, f1i, f1j, f1k, f1l, f1m, f1n, f1o, f1p, f1q, f1r, f1s, f1t)

display the names of the dataframe newds
display the structural of the first 10 variables

names(newds)

##  [1] "cesd"   "female" "i1"     "i2"     "id"     "treat"  "f1a"   
##  [8] "f1b"    "f1c"    "f1d"    "f1e"    "f1f"    "f1g"    "f1h"   
## [15] "f1i"    "f1j"    "f1k"    "f1l"    "f1m"    "f1n"    "f1o"   
## [22] "f1p"    "f1q"    "f1r"    "f1s"    "f1t"

str(newds[,1:10]) # structure of the first 10 variables

## 'data.frame':    453 obs. of  10 variables:
##  $ cesd  : int  49 30 39 15 39 6 52 32 50 46 ...
##  $ female: int  0 0 0 1 0 1 1 0 1 0 ...
##  $ i1    : int  13 56 0 5 10 4 13 12 71 20 ...
##  $ i2    : int  26 62 0 5 13 4 20 24 129 27 ...
##  $ id    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ treat : int  1 1 0 0 0 1 0 1 0 1 ...
##  $ f1a   : int  3 3 3 0 3 1 3 1 3 2 ...
##  $ f1b   : int  2 2 2 0 0 0 1 1 2 3 ...
##  $ f1c   : int  3 0 3 1 3 1 3 2 3 3 ...
##  $ f1d   : int  0 3 0 3 3 3 1 3 1 0 ...

summary of the first 10 variables

summary(newds[,1:10]) # summary of the first 10 variables

##       cesd          female            i1              i2       
##  Min.   : 1.0   Min.   :0.000   Min.   :  0.0   Min.   :  0.0  
##  1st Qu.:25.0   1st Qu.:0.000   1st Qu.:  3.0   1st Qu.:  3.0  
##  Median :34.0   Median :0.000   Median : 13.0   Median : 15.0  
##  Mean   :32.8   Mean   :0.236   Mean   : 17.9   Mean   : 22.6  
##  3rd Qu.:41.0   3rd Qu.:0.000   3rd Qu.: 26.0   3rd Qu.: 32.0  
##  Max.   :60.0   Max.   :1.000   Max.   :142.0   Max.   :184.0  
##        id          treat            f1a            f1b      
##  Min.   :  1   Min.   :0.000   Min.   :0.00   Min.   :0.00  
##  1st Qu.:119   1st Qu.:0.000   1st Qu.:1.00   1st Qu.:0.00  
##  Median :233   Median :0.000   Median :2.00   Median :1.00  
##  Mean   :233   Mean   :0.497   Mean   :1.63   Mean   :1.39  
##  3rd Qu.:348   3rd Qu.:1.000   3rd Qu.:3.00   3rd Qu.:2.00  
##  Max.   :470   Max.   :1.000   Max.   :3.00   Max.   :3.00  
##       f1c            f1d      
##  Min.   :0.00   Min.   :0.00  
##  1st Qu.:1.00   1st Qu.:0.00  
##  Median :2.00   Median :1.00  
##  Mean   :1.92   Mean   :1.56  
##  3rd Qu.:3.00   3rd Qu.:3.00  
##  Max.   :3.00   Max.   :3.00

Display the first 3 rows of newds.

head(newds, n=3)

##   cesd female i1 i2 id treat f1a f1b f1c f1d f1e f1f f1g f1h f1i f1j
## 1   49      0 13 26  1     1   3   2   3   0   2   3   3   0   2   3
## 2   30      0 56 62  2     1   3   2   0   3   3   2   0   0   3   0
## 3   39      0  0  0  3     0   3   2   3   0   2   2   1   3   2   3
##   f1k f1l f1m f1n f1o f1p f1q f1r f1s f1t
## 1   3   0   1   2   2   2   2   3   3   2
## 2   3   0   0   3   0   0   0   2   0   0
## 3   1   0   1   3   2   0   0   3   2   0

Give comment to the dataframe “newds” and save as an R object “ds” to the file “savedfile”

comment(newds) = "HELP baseline dataset"
comment(newds)

## [1] "HELP baseline dataset"

save(ds, file="savedfile")

save “ds” as a csv file

write.csv(ds, file="ds.csv")

load the package “foreign”
The foreign package contains functions that will allow you to import data files from some of the most commonly used statistical software packages such as SAS, Stata and SPSS.
Exports data frames “newds” to other statistical packages “SAS” by writing the data as free-format text and writing a separate file of instructions for the other package to read the data.

library(foreign)
write.foreign(newds, "file.dat", "file.sas", package="SAS")

display 10 rows of the variable “cesd” in the dataframe “newds”

with(newds, cesd[1:10])

##  [1] 49 30 39 15 39  6 52 32 50 46

with(newds, head(cesd, 10))

##  [1] 49 30 39 15 39  6 52 32 50 46

display the data which exceeds 56 of the variable “cesd” in the dataframe “newds”

with(newds, cesd[cesd > 56])

## [1] 57 58 57 60 58 58 57

load the package “dplyr”
pick rows which the value exceeds 56 from the column “cesd” and then pick the columns “id” and “cesd”

library(dplyr)
filter(newds, cesd > 56) %>% select(id, cesd)

##    id cesd
## 1  71   57
## 2 127   58
## 3 200   57
## 4 228   60
## 5 273   58
## 6 351   58
## 7  13   57

Sort (or order) newds$cesd into ascending order and display the first 4 elements.

2.show Which cells have the minumum / maximum value of newds$cesd

with(newds, sort(cesd)[1:4])

## [1] 1 3 3 4

with(newds, which.min(cesd))

## [1] 199

load the package “mosaic”
Count the missing values of newds$f1g
“favstats” is short for “favorite statistics” which comes from “mosaic” package: it will give you the some of the most popular summary statistics for numerical variables.

library(mosaic)
tally(~ is.na(f1g), data=newds)

## is.na(f1g)
##  TRUE FALSE 
##     1   452

favstats(~ f1g, data=newds)

##  min Q1 median Q3 max mean  sd   n missing
##    0  1      2  3   3 1.73 1.1 452       1

select the columns
reverse code “f1d, f1h, f1l and f1p” by subtracting all the cells of those identified columns from the maximum number of your scale
combine the columns
rename it cesditems
count the sum of missing values on the row of cesditems.
duplicate cesditems and name is nceditems.
Fill in the missing values with 0
returns the sum of all the values of the row in ncesidtems.
mean impuataion for missing data

# reverse code f1d, f1h, f1l and f1p
cesditems = with(newds, cbind(f1a, f1b, f1c, (3 - f1d), f1e, f1f, f1g, 
   (3 - f1h), f1i, f1j, f1k, (3 - f1l), f1m, f1n, f1o, (3 - f1p), 
   f1q, f1r, f1s, f1t))
nmisscesd = apply(is.na(cesditems), 1, sum)
ncesditems = cesditems
ncesditems[is.na(cesditems)] = 0
newcesd = apply(ncesditems, 1, sum)
imputemeancesd = 20/(20-nmisscesd)*newcesd

select columns and turn it to a dataframe

data.frame(newcesd, newds$cesd, nmisscesd, imputemeancesd)[nmisscesd>0,]

##     newcesd newds.cesd nmisscesd imputemeancesd
## 4        15         15         1           15.8
## 17       19         19         1           20.0
## 87       44         44         1           46.3
## 101      17         17         1           17.9
## 154      29         29         1           30.5
## 177      44         44         1           46.3
## 229      39         39         1           41.1

add new variable “drinkstat” with three levels which indicate different drinking status.

library(dplyr)
library(memisc)
library(lattice)
library(MASS)
newds = mutate(newds, drinkstat= 
  cases(
    "abstinent" = i1==0,
    "moderate" = (i1>0 & i1<=1 & i2<=3 & female==1) |
               (i1>0 & i1<=2 & i2<=4 & female==0),
    "highrisk" = ((i1>1 | i2>3) & female==1) |
               ((i1>2 | i2>4) & female==0)))

load the package “mosaic”. Don’t show the code.
remove the package “memisc”
remove the package “MASS”

## ----echo=FALSE----------------------------------------------------------
library(mosaic)

## ----echo=FALSE----------------------------------------------------------
detach(package:memisc)
detach(package:MASS)

1.pick colums i1,i2, female, drinkstat from the dataframe “newds” and name the new dataframe “tmpds”

3.display row365-row370

library(dplyr)
tmpds = select(newds, i1, i2, female, drinkstat)
tmpds[365:370,]

##     i1 i2 female drinkstat
## 365  6 24      0  highrisk
## 366  6  6      0  highrisk
## 367  0  0      0 abstinent
## 368  0  0      1 abstinent
## 369  8  8      0  highrisk
## 370 32 32      0  highrisk

pick rows that matches the condition that the drinkstat is moderate and is a female

library(dplyr)
filter(tmpds, drinkstat=="moderate" & female==1)

##   i1 i2 female drinkstat
## 1  1  1      1  moderate
## 2  1  3      1  moderate
## 3  1  2      1  moderate
## 4  1  1      1  moderate
## 5  1  1      1  moderate
## 6  1  1      1  moderate
## 7  1  1      1  moderate

load the package “gmodels”
The CrossTable( ) function in the gmodels package produces crosstabulations modeled after PROC FREQ in SAS or CROSSTABS in SPSS.
Display the frequency distribution table of the severity of drinking.

library(gmodels)
with(tmpds, CrossTable(drinkstat))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  453 
## 
##  
##           | abstinent |  moderate |  highrisk | 
##           |-----------|-----------|-----------|
##           |        68 |        28 |       357 | 
##           |     0.150 |     0.062 |     0.788 | 
##           |-----------|-----------|-----------|
## 
## 
## 
##

Display the frequency distribution table in two variables, the severity of drinking and gender.

with(tmpds, CrossTable(drinkstat, female, 
                       prop.t=FALSE, prop.c=FALSE, prop.chisq=FALSE))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  453 
## 
##  
##              | female 
##    drinkstat |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##    abstinent |        42 |        26 |        68 | 
##              |     0.618 |     0.382 |     0.150 | 
## -------------|-----------|-----------|-----------|
##     moderate |        21 |         7 |        28 | 
##              |     0.750 |     0.250 |     0.062 | 
## -------------|-----------|-----------|-----------|
##     highrisk |       283 |        74 |       357 | 
##              |     0.793 |     0.207 |     0.788 | 
## -------------|-----------|-----------|-----------|
## Column Total |       346 |       107 |       453 | 
## -------------|-----------|-----------|-----------|
## 
##

add a new variable “gender” with the existing variable “female”
create a frequency distribution table

newds = transform(newds, 
  gender=factor(female, c(0,1), c("Male","Female")))
tally(~ female + gender, margin=FALSE, data=newds)

##       gender
## female Male Female
##      0  346      0
##      1    0    107

reorder the row“cesd, i1” by ascending order
show the columns “cesd” “i1” “id” by 5 rows.

library(dplyr)
newds = arrange(ds, cesd, i1)
newds[1:5, c("cesd", "i1", "id")]

##   cesd i1  id
## 1    1  3 233
## 2    3  1 139
## 3    3 13 418
## 4    4  4 251
## 5    4  9  95

pick rows under the conditions that is a female and Compute the mean of CESD score in female subset.

library(dplyr)
females = filter(ds, female==1)
with(females, mean(cesd))

## [1] 36.9

# an alternative approach
mean(ds$cesd[ds$female==1])

## [1] 36.9

The mosaic package makes several summary statistic functions (like mean and sd) formula aware.
compute mean CESD score of male and female

with(ds, tapply(cesd, female, mean))

##    0    1 
## 31.6 36.9

library(mosaic)
mean(cesd ~ female, data=ds)

##    0    1 
## 31.6 36.9

3.30Data Wrangling homework4~6

Chen Meng-ting(Eunice Chen)