Homework4
Reverse the order of input to the series of dplyr::*_join examples using data from the Nobel laureates in literature and explain the resulting output.
library(dplyr)
They share “Year” on the column.
nbl_c <- read.table("C:/Users/user/Desktop/nobel_countries.txt", h = T)
nbl_w <- read.table("C:/Users/user/Desktop/nobel_winners.txt", h = T)
head(nbl_c)
## Country Year
## 1 France 2014
## 2 UK 1950
## 3 UK 2017
## 4 US 2016
## 5 Canada 2013
## 6 China 2012
head(nbl_w)
## Name Gender Year
## 1 Patrick Modiano Male 2014
## 2 Bertrand Russell Male 1950
## 3 Kazuo Ishiguro Male 2017
## 4 Bob Dylan Male 2016
## 5 Alice Munro Female 2013
## 6 Mo Yan Male 2012
inner join The column “Year” is the column both datasets share.
We merge two datasets by “Year” and select only rows that match with two datasets is an “inner join”
we keep all columns from “nbl_w” and “nbl_c.”
dplyr::inner_join(nbl_c,nbl_w)%>% arrange(Year)
## Joining, by = "Year"
## Country Year Name Gender
## 1 UK 1950 Bertrand Russell Male
## 2 China 2012 Mo Yan Male
## 3 Canada 2013 Alice Munro Female
## 4 France 2014 Patrick Modiano Male
## 5 US 2016 Bob Dylan Male
## 6 UK 2017 Kazuo Ishiguro Male
semi join Return all rows from “nbl_c” where there are matching values in nbl_w.
With semi join, we keeping just columns from nbl_c.
Comparing with inner join, we omit the column “Name” and “Gender”
semi_join(nbl_c,nbl_w)%>% arrange(Year)
## Joining, by = "Year"
## Country Year
## 1 UK 1950
## 2 China 2012
## 3 Canada 2013
## 4 France 2014
## 5 US 2016
## 6 UK 2017
left_join(nbl_c, nbl_w) The order of input is nbl_c then nbl_w, reverse order from teacher’s example.
Include everything on the left (what was the nbl_c data frame ) and all rows that match from the right (nbl_w) data frame.
Keep all rows from nbl_c , and all columns from nbl_c and nbl_w.(Year, Gender, Name. Country). All combination of the matches are returned.
left_join(nbl_c,nbl_w)%>% arrange(Year)
## Joining, by = "Year"
## Country Year Name Gender
## 1 UK 1950 Bertrand Russell Male
## 2 Sweden 2011 <NA> <NA>
## 3 China 2012 Mo Yan Male
## 4 Canada 2013 Alice Munro Female
## 5 France 2014 Patrick Modiano Male
## 6 Russia 2015 <NA> <NA>
## 7 US 2016 Bob Dylan Male
## 8 UK 2017 Kazuo Ishiguro Male
anti_join Finding rows that didn’t match is an anti join.
Return all rows from nbl_c where there are not matching values in nbl_w, keeping just columns from nbl_c.(Year, Country).
The result is the opposite of inner join
anti_join(nbl_c,nbl_w)
## Joining, by = "Year"
## Country Year
## 1 Russia 2015
## 2 Sweden 2011
full_join
Return all rows and all columns from both nbl_c and nbl_w.
The result will be the same to basic R (merge() with all = TRUE)
full_join(nbl_c, nbl_w)
## Joining, by = "Year"
## Country Year Name Gender
## 1 France 2014 Patrick Modiano Male
## 2 UK 1950 Bertrand Russell Male
## 3 UK 2017 Kazuo Ishiguro Male
## 4 US 2016 Bob Dylan Male
## 5 Canada 2013 Alice Munro Female
## 6 China 2012 Mo Yan Male
## 7 Russia 2015 <NA> <NA>
## 8 Sweden 2011 <NA> <NA>
## 9 <NA> 1938 Pearl Buck Female
Homework5
Augment the data object in the ‘SAT’ lecture note with state.division{datasets}. For each of the 9 divisions, find the slope estimate for regressing average SAT scores onto average teacher’s salary. How many of them are of negative signs?
library(datasets)
library(lattice)
fL <- "http://www.amstat.org/publications/jse/datasets/sat.dat.txt"
dta <- read.table(fL, row.names=1)
head(dta)
## V2 V3 V4 V5 V6 V7 V8
## Alabama 4.405 17.2 31.144 8 491 538 1029
## Alaska 8.963 17.6 47.951 47 445 489 934
## Arizona 4.778 19.3 32.175 27 448 496 944
## Arkansas 4.459 17.1 28.934 6 482 523 1005
## California 4.992 24.0 41.078 45 417 485 902
## Colorado 5.443 18.4 34.571 29 462 518 980
Re-code variable names for clarity
names(dta) <- c("Spending", "PTR", "Salary", "PE", "Verbal", "Math", "SAT")
Augment data with the built-in state.division{datasets}
There are 9 state divisions with the numbers of schools in each division.
dta$divisions <- state.division
with(dta, table(divisions))
## divisions
## New England Middle Atlantic South Atlantic East South Central
## 6 3 8 4
## West South Central East North Central West North Central Mountain
## 4 5 7 8
## Pacific
## 5
lattice::xyplot(SAT ~ Salary, groups=divisions, data=dta, type=c('r', 'g'), auto.key=list(columns=3))
dta %>%
group_by(divisions) %>%
dplyr::summarize(r=cor(SAT, Salary))
## # A tibble: 9 x 2
## divisions r
## <fct> <dbl>
## 1 New England -0.0830
## 2 Middle Atlantic 0.662
## 3 South Atlantic 0.489
## 4 East South Central -0.372
## 5 West South Central -0.884
## 6 East North Central 0.524
## 7 West North Central -0.206
## 8 Mountain -0.729
## 9 Pacific 0.0649
conclusion There are five divisions that average SAT scores onto average teacher’s salary are negative sins which includes New England, East South Central, West South Central, West North Central and mountain.
Homework6
The following R script is used to manage the data file at the initial stage of investigation. Provide comments on what each line of the script is meant to achieve.
run the code(eval=TRUE) but not display the code(echo=FALSE) in the output file
a string setting the prompt used for lines which continue more than one line.
The digits option controls how many digits are printed (3)
The width option says how many characters R should put on a line.
# echo=FALSE,eval=TRUE
options(continue=" ")
options(digits=3)
options(width=72) # narrow output
read the data from an online file
load “dplyr”
select some variables and name the dataframe “newds”
ds = read.csv("http://www.amherst.edu/~nhorton/r2/datasets/help.csv")
library(dplyr)
newds = select(ds, cesd, female, i1, i2, id, treat, f1a, f1b, f1c, f1d, f1e,
f1f, f1g, f1h, f1i, f1j, f1k, f1l, f1m, f1n, f1o, f1p, f1q, f1r, f1s, f1t)
display the names of the dataframe newds
display the structural of the first 10 variables
names(newds)
## [1] "cesd" "female" "i1" "i2" "id" "treat" "f1a"
## [8] "f1b" "f1c" "f1d" "f1e" "f1f" "f1g" "f1h"
## [15] "f1i" "f1j" "f1k" "f1l" "f1m" "f1n" "f1o"
## [22] "f1p" "f1q" "f1r" "f1s" "f1t"
str(newds[,1:10]) # structure of the first 10 variables
## 'data.frame': 453 obs. of 10 variables:
## $ cesd : int 49 30 39 15 39 6 52 32 50 46 ...
## $ female: int 0 0 0 1 0 1 1 0 1 0 ...
## $ i1 : int 13 56 0 5 10 4 13 12 71 20 ...
## $ i2 : int 26 62 0 5 13 4 20 24 129 27 ...
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ treat : int 1 1 0 0 0 1 0 1 0 1 ...
## $ f1a : int 3 3 3 0 3 1 3 1 3 2 ...
## $ f1b : int 2 2 2 0 0 0 1 1 2 3 ...
## $ f1c : int 3 0 3 1 3 1 3 2 3 3 ...
## $ f1d : int 0 3 0 3 3 3 1 3 1 0 ...
summary of the first 10 variables
summary(newds[,1:10]) # summary of the first 10 variables
## cesd female i1 i2
## Min. : 1.0 Min. :0.000 Min. : 0.0 Min. : 0.0
## 1st Qu.:25.0 1st Qu.:0.000 1st Qu.: 3.0 1st Qu.: 3.0
## Median :34.0 Median :0.000 Median : 13.0 Median : 15.0
## Mean :32.8 Mean :0.236 Mean : 17.9 Mean : 22.6
## 3rd Qu.:41.0 3rd Qu.:0.000 3rd Qu.: 26.0 3rd Qu.: 32.0
## Max. :60.0 Max. :1.000 Max. :142.0 Max. :184.0
## id treat f1a f1b
## Min. : 1 Min. :0.000 Min. :0.00 Min. :0.00
## 1st Qu.:119 1st Qu.:0.000 1st Qu.:1.00 1st Qu.:0.00
## Median :233 Median :0.000 Median :2.00 Median :1.00
## Mean :233 Mean :0.497 Mean :1.63 Mean :1.39
## 3rd Qu.:348 3rd Qu.:1.000 3rd Qu.:3.00 3rd Qu.:2.00
## Max. :470 Max. :1.000 Max. :3.00 Max. :3.00
## f1c f1d
## Min. :0.00 Min. :0.00
## 1st Qu.:1.00 1st Qu.:0.00
## Median :2.00 Median :1.00
## Mean :1.92 Mean :1.56
## 3rd Qu.:3.00 3rd Qu.:3.00
## Max. :3.00 Max. :3.00
Display the first 3 rows of newds.
head(newds, n=3)
## cesd female i1 i2 id treat f1a f1b f1c f1d f1e f1f f1g f1h f1i f1j
## 1 49 0 13 26 1 1 3 2 3 0 2 3 3 0 2 3
## 2 30 0 56 62 2 1 3 2 0 3 3 2 0 0 3 0
## 3 39 0 0 0 3 0 3 2 3 0 2 2 1 3 2 3
## f1k f1l f1m f1n f1o f1p f1q f1r f1s f1t
## 1 3 0 1 2 2 2 2 3 3 2
## 2 3 0 0 3 0 0 0 2 0 0
## 3 1 0 1 3 2 0 0 3 2 0
Give comment to the dataframe “newds” and save as an R object “ds” to the file “savedfile”
comment(newds) = "HELP baseline dataset"
comment(newds)
## [1] "HELP baseline dataset"
save(ds, file="savedfile")
save “ds” as a csv file
write.csv(ds, file="ds.csv")
load the package “foreign”
The foreign package contains functions that will allow you to import data files from some of the most commonly used statistical software packages such as SAS, Stata and SPSS.
Exports data frames “newds” to other statistical packages “SAS” by writing the data as free-format text and writing a separate file of instructions for the other package to read the data.
library(foreign)
write.foreign(newds, "file.dat", "file.sas", package="SAS")
display 10 rows of the variable “cesd” in the dataframe “newds”
with(newds, cesd[1:10])
## [1] 49 30 39 15 39 6 52 32 50 46
with(newds, head(cesd, 10))
## [1] 49 30 39 15 39 6 52 32 50 46
display the data which exceeds 56 of the variable “cesd” in the dataframe “newds”
with(newds, cesd[cesd > 56])
## [1] 57 58 57 60 58 58 57
load the package “dplyr”
pick rows which the value exceeds 56 from the column “cesd” and then pick the columns “id” and “cesd”
library(dplyr)
filter(newds, cesd > 56) %>% select(id, cesd)
## id cesd
## 1 71 57
## 2 127 58
## 3 200 57
## 4 228 60
## 5 273 58
## 6 351 58
## 7 13 57
2.show Which cells have the minumum / maximum value of newds$cesd
with(newds, sort(cesd)[1:4])
## [1] 1 3 3 4
with(newds, which.min(cesd))
## [1] 199
load the package “mosaic”
Count the missing values of newds$f1g
“favstats” is short for “favorite statistics” which comes from “mosaic” package: it will give you the some of the most popular summary statistics for numerical variables.
library(mosaic)
tally(~ is.na(f1g), data=newds)
## is.na(f1g)
## TRUE FALSE
## 1 452
favstats(~ f1g, data=newds)
## min Q1 median Q3 max mean sd n missing
## 0 1 2 3 3 1.73 1.1 452 1
select the columns
reverse code “f1d, f1h, f1l and f1p” by subtracting all the cells of those identified columns from the maximum number of your scale
combine the columns
rename it cesditems
count the sum of missing values on the row of cesditems.
duplicate cesditems and name is nceditems.
Fill in the missing values with 0
returns the sum of all the values of the row in ncesidtems.
mean impuataion for missing data
# reverse code f1d, f1h, f1l and f1p
cesditems = with(newds, cbind(f1a, f1b, f1c, (3 - f1d), f1e, f1f, f1g,
(3 - f1h), f1i, f1j, f1k, (3 - f1l), f1m, f1n, f1o, (3 - f1p),
f1q, f1r, f1s, f1t))
nmisscesd = apply(is.na(cesditems), 1, sum)
ncesditems = cesditems
ncesditems[is.na(cesditems)] = 0
newcesd = apply(ncesditems, 1, sum)
imputemeancesd = 20/(20-nmisscesd)*newcesd
select columns and turn it to a dataframe
data.frame(newcesd, newds$cesd, nmisscesd, imputemeancesd)[nmisscesd>0,]
## newcesd newds.cesd nmisscesd imputemeancesd
## 4 15 15 1 15.8
## 17 19 19 1 20.0
## 87 44 44 1 46.3
## 101 17 17 1 17.9
## 154 29 29 1 30.5
## 177 44 44 1 46.3
## 229 39 39 1 41.1
library(dplyr)
library(memisc)
library(lattice)
library(MASS)
newds = mutate(newds, drinkstat=
cases(
"abstinent" = i1==0,
"moderate" = (i1>0 & i1<=1 & i2<=3 & female==1) |
(i1>0 & i1<=2 & i2<=4 & female==0),
"highrisk" = ((i1>1 | i2>3) & female==1) |
((i1>2 | i2>4) & female==0)))
load the package “mosaic”. Don’t show the code.
remove the package “memisc”
remove the package “MASS”
## ----echo=FALSE----------------------------------------------------------
library(mosaic)
## ----echo=FALSE----------------------------------------------------------
detach(package:memisc)
detach(package:MASS)
1.pick colums i1,i2, female, drinkstat from the dataframe “newds” and name the new dataframe “tmpds”
3.display row365-row370
library(dplyr)
tmpds = select(newds, i1, i2, female, drinkstat)
tmpds[365:370,]
## i1 i2 female drinkstat
## 365 6 24 0 highrisk
## 366 6 6 0 highrisk
## 367 0 0 0 abstinent
## 368 0 0 1 abstinent
## 369 8 8 0 highrisk
## 370 32 32 0 highrisk
pick rows that matches the condition that the drinkstat is moderate and is a female
library(dplyr)
filter(tmpds, drinkstat=="moderate" & female==1)
## i1 i2 female drinkstat
## 1 1 1 1 moderate
## 2 1 3 1 moderate
## 3 1 2 1 moderate
## 4 1 1 1 moderate
## 5 1 1 1 moderate
## 6 1 1 1 moderate
## 7 1 1 1 moderate
library(gmodels)
with(tmpds, CrossTable(drinkstat))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 453
##
##
## | abstinent | moderate | highrisk |
## |-----------|-----------|-----------|
## | 68 | 28 | 357 |
## | 0.150 | 0.062 | 0.788 |
## |-----------|-----------|-----------|
##
##
##
##
Display the frequency distribution table in two variables, the severity of drinking and gender.
with(tmpds, CrossTable(drinkstat, female,
prop.t=FALSE, prop.c=FALSE, prop.chisq=FALSE))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 453
##
##
## | female
## drinkstat | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## abstinent | 42 | 26 | 68 |
## | 0.618 | 0.382 | 0.150 |
## -------------|-----------|-----------|-----------|
## moderate | 21 | 7 | 28 |
## | 0.750 | 0.250 | 0.062 |
## -------------|-----------|-----------|-----------|
## highrisk | 283 | 74 | 357 |
## | 0.793 | 0.207 | 0.788 |
## -------------|-----------|-----------|-----------|
## Column Total | 346 | 107 | 453 |
## -------------|-----------|-----------|-----------|
##
##
newds = transform(newds,
gender=factor(female, c(0,1), c("Male","Female")))
tally(~ female + gender, margin=FALSE, data=newds)
## gender
## female Male Female
## 0 346 0
## 1 0 107
reorder the row“cesd, i1” by ascending order
show the columns “cesd” “i1” “id” by 5 rows.
library(dplyr)
newds = arrange(ds, cesd, i1)
newds[1:5, c("cesd", "i1", "id")]
## cesd i1 id
## 1 1 3 233
## 2 3 1 139
## 3 3 13 418
## 4 4 4 251
## 5 4 9 95
pick rows under the conditions that is a female and Compute the mean of CESD score in female subset.
library(dplyr)
females = filter(ds, female==1)
with(females, mean(cesd))
## [1] 36.9
# an alternative approach
mean(ds$cesd[ds$female==1])
## [1] 36.9
The mosaic package makes several summary statistic functions (like mean and sd) formula aware.
compute mean CESD score of male and female
with(ds, tapply(cesd, female, mean))
## 0 1
## 31.6 36.9
library(mosaic)
mean(cesd ~ female, data=ds)
## 0 1
## 31.6 36.9