Rename the starter file under the analysis directory as hw_01_yourname.Rmd and use it for your solutions.
1. Modify the “author” field in the YAML header.
2. Stage and Commit R Markdown and HTML files (no PDF files).
3. Push both .Rmd and HTML files to GitHub.
- Make sure you have knitted to HTML prior to staging, committing, and pushing your final submission.
4. Commit each time you answer a part of question, e.g. 1.1
5. Push to GitHub after each major question, e.g., Scrabble and Civil War Battles
- Committing and Pushing are graded elements for this homework.
6. When complete, submit a response in Canvas
Loading Libraries
library(readr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3 ✓ dplyr 1.0.6
## ✓ tibble 3.1.2 ✓ stringr 1.4.0
## ✓ tidyr 1.1.3 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(tidyr)
library(ggthemes)
library(forcats)
For this exercise, we are using the Collins Scrabble Words, which is most commonly used outside of the United States. The dictionary most often used in the United States is the Tournament Word List.
WARNING: Do not try str_view() or str_view_all() on these data.It will stall your computer.
scrabble <- read_tsv(file = "../data/words.txt")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## word = col_character()
## )
scrabble %>%
# removed NA
filter(!is.na(word)) -> scrabble_01
head(scrabble_01)
## # A tibble: 6 x 1
## word
## <chr>
## 1 AA
## 2 AAH
## 3 AAHED
## 4 AAHING
## 5 AAHS
## 6 AAL
scrabble_01 %>%
# count the length of word, and count "X" in each word
mutate(Xcount = str_count(word, "X"),
count = str_length(word)) -> scrabble_02
scrabble_02 %>%
arrange(desc(Xcount)) %>%
filter(Xcount == max(scrabble_02$Xcount)) %>%
arrange(desc(count)) -> six_longest
head(six_longest)
## # A tibble: 6 x 3
## word Xcount count
## <chr> <int> <int>
## 1 COEXECUTRIXES 2 13
## 2 EXTRATEXTUAL 2 12
## 3 COEXECUTRIX 2 11
## 4 EXECUTRIXES 2 11
## 5 SAXITOXINS 2 10
## 6 XANTHOXYLS 2 10
scrabble_02 %>%
# distinguish whether the length of word is odd or even
mutate(odd_even = if_else(str_length(word) %% 2 == 0, "even", "odd")) %>%
# filter head half, if the length is odd, we remain the middle one of this word.
mutate(head_h = if_else(odd_even == "even",
str_sub(word, 1, count/2), str_sub(word, 1, floor(count/2)))) %>%
# filter tail half
mutate(tail_h = if_else(odd_even == "even",
str_sub(word, (count/2)+1, count), str_sub(word, ceiling(count/2)+1, count))) %>%
# Equal is TRUE, otherwise FALSE, and adding in df
mutate(Halves = str_detect(head_h, tail_h)) -> scrabble_03
# count number if Halves is TRUE
scrabble_03 %>%
filter(Halves == TRUE) %>%
nrow()
## [1] 254
# scrabble_02 %>% arrange(desc(count)) -> make sure the longest string is 15 units,
# so the half of string is 7 units at most
# () = group
#.? = 0 or 1
# period = any word
# \\1 for the match in the first parentheses
scrabble_03 %>%
filter(str_detect(word, "^(.|..|...|....|.....|......|.......).?\\1$")) -> Two_Halves
nrow(Two_Halves)
## [1] 254
Two_Halves %>%
arrange(desc(count)) %>%
head(1)
## # A tibble: 1 x 7
## word Xcount count odd_even head_h tail_h Halves
## <chr> <int> <int> <chr> <chr> <chr> <lgl>
## 1 CHIQUICHIQUI 0 12 even CHIQUI CHIQUI TRUE
The data in “civil_war_theater.csv” contains a information on American Civil War battles, taken from Wikipedia.
Variables include:
Battle: The name of the battle.Date: The date(s) of the battle in different formats depending upon the length of the battle.
State: The state where the battle took place. Annotations (e.g. describing that the state was a territory at the time) are in parentheses.CWSAC: A rating of the military significance of the battle by the Civil War Sites Advisory Commission. A = Decisive, B = Major, C = Formative, D = Limited.Outcome: Usually "Confederate victory", "Union victory", or "Inconclusive", followed by notes.Theater: An attempt to to identify which theater of war is most associated with the battleCivilWar <- read_csv(file = "../data/civil_war_theater.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Battle = col_character(),
## Date = col_character(),
## State = col_character(),
## CWSAC = col_character(),
## Theater = col_character(),
## Outcome = col_character()
## )
head(CivilWar)
## # A tibble: 6 x 6
## Battle Date State CWSAC Theater Outcome
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Battle of Fort… July 11-… District … B Eastern Union victory: Failed Conf…
## 2 Battle of Hanc… January … Maryland D Eastern Inconclusive: Unsuccessful…
## 3 Battle of Sout… Septembe… Maryland B Eastern Union victory: McClellan d…
## 4 Battle of Anti… Septembe… Maryland A Eastern Tactically inconclusive; s…
## 5 Battle of Will… July 6-1… Maryland C Eastern Inconclusive: Meade and Le…
## 6 Battle of Boon… July 8, … Maryland D Eastern Inconclusive: Indecisive a…
The next several questions will help you take the dates from all the different formats and create a consistent set of start date and end date variables in the data frame. We will start by calculating how many years, and months are in each battle.
Create a character variable as follows. This can be used as a pattern in a regular expression.
year_regex <- stringr::str_c(1861:1865, collapse = "|")
year_regex
## [1] "1861|1862|1863|1864|1865"Use year_regex to now count the number of years in each battle, add this to the data frame, and save the data frame.
CivilWar %>%
mutate(count_Year = str_count(Date, year_regex)) %>%
select(Battle,count_Year, everything()) -> CivilWar_02
head(CivilWar_02)
## # A tibble: 6 x 7
## Battle count_Year Date State CWSAC Theater Outcome
## <chr> <int> <chr> <chr> <chr> <chr> <chr>
## 1 Battle of Fo… 1 July 11… Distric… B Eastern Union victory: Faile…
## 2 Battle of Ha… 1 January… Maryland D Eastern Inconclusive: Unsucc…
## 3 Battle of So… 1 Septemb… Maryland B Eastern Union victory: McCle…
## 4 Battle of An… 1 Septemb… Maryland A Eastern Tactically inconclus…
## 5 Battle of Wi… 1 July 6-… Maryland C Eastern Inconclusive: Meade …
## 6 Battle of Bo… 1 July 8,… Maryland D Eastern Inconclusive: Indeci…
Consider R’s built-in vector of month names: month.name.
month.name
## [1] "January" "February" "March" "April" "May" "June"
## [7] "July" "August" "September" "October" "November" "December"
# create each month name with regex first
month_name <- str_c(month.name, collapse = "|")
print(month_name)
## [1] "January|February|March|April|May|June|July|August|September|October|November|December"Use month.name to count the number of month names in the Date variable in each battle.
Add this to the data frame. (You might need to do something similar to what we did in part 2).
CivilWar_02 %>%
mutate(count_Month = str_count(Date, month_name)) %>%
select(Battle,count_Year,count_Month, everything()) -> CivilWar_03
head(CivilWar_03)
## # A tibble: 6 x 8
## Battle count_Year count_Month Date State CWSAC Theater Outcome
## <chr> <int> <int> <chr> <chr> <chr> <chr> <chr>
## 1 Battle of… 1 1 July 1… Distr… B Eastern Union victory:…
## 2 Battle of… 1 1 Januar… Maryl… D Eastern Inconclusive: …
## 3 Battle of… 1 1 Septem… Maryl… B Eastern Union victory:…
## 4 Battle of… 1 1 Septem… Maryl… A Eastern Tactically inc…
## 5 Battle of… 1 1 July 6… Maryl… C Eastern Inconclusive: …
## 6 Battle of… 1 1 July 8… Maryl… D Eastern Inconclusive: …
TRUE if Date spans multiple days and is FALSE otherwise. Spanning multiple months and/or years also counts as TRUE.CivilWar_03 %>%
# str_detect: Match a fixed string, and return TRUE or FALSE
mutate(Multiple_days = str_detect(Date, "-")) %>%
select(Battle:count_Month, Multiple_days, everything()) -> CivilWar_04
head(CivilWar_04)
## # A tibble: 6 x 9
## Battle count_Year count_Month Multiple_days Date State CWSAC Theater Outcome
## <chr> <int> <int> <lgl> <chr> <chr> <chr> <chr> <chr>
## 1 Battle… 1 1 TRUE July… Dist… B Eastern Union …
## 2 Battle… 1 1 TRUE Janu… Mary… D Eastern Inconc…
## 3 Battle… 1 1 FALSE Sept… Mary… B Eastern Union …
## 4 Battle… 1 1 FALSE Sept… Mary… A Eastern Tactic…
## 5 Battle… 1 1 TRUE July… Mary… C Eastern Inconc…
## 6 Battle… 1 1 FALSE July… Mary… D Eastern Inconc…
# df of battles spanning just one day
CivilWar_04 %>%
filter(Multiple_days == FALSE) -> Battles_one_Day
head(Battles_one_Day)
## # A tibble: 6 x 9
## Battle count_Year count_Month Multiple_days Date State CWSAC Theater Outcome
## <chr> <int> <int> <lgl> <chr> <chr> <chr> <chr> <chr>
## 1 Battle… 1 1 FALSE Sept… Mary… B Eastern Union …
## 2 Battle… 1 1 FALSE Sept… Mary… A Eastern Tactic…
## 3 Battle… 1 1 FALSE July… Mary… D Eastern Inconc…
## 4 Battle… 1 1 FALSE July… Mary… B Eastern Confed…
## 5 Battle… 1 1 FALSE Augu… Mary… D Eastern Inconc…
## 6 Battle… 1 1 FALSE June… Penn… C Eastern Inconc…
# df of battles spanning multiple days in just one month
CivilWar_04 %>%
filter(Multiple_days == TRUE & count_Month == 1) -> Battles_one_Month
head(Battles_one_Month)
## # A tibble: 6 x 9
## Battle count_Year count_Month Multiple_days Date State CWSAC Theater Outcome
## <chr> <int> <int> <lgl> <chr> <chr> <chr> <chr> <chr>
## 1 Battle… 1 1 TRUE July… Dist… B Eastern Union …
## 2 Battle… 1 1 TRUE Janu… Mary… D Eastern Inconc…
## 3 Battle… 1 1 TRUE July… Mary… C Eastern Inconc…
## 4 Battle… 1 1 TRUE July… Penn… A Eastern Union …
## 5 Battle… 1 1 TRUE May … Virg… D Eastern Inconc…
## 6 Battle… 1 1 TRUE Marc… Virg… B Eastern Inconc…
# df of battles spanning multiple months but not multiple years
CivilWar_04 %>%
filter(Multiple_days == TRUE & count_Year == 1 & count_Month != 1) -> Battles_multiple_Months
head(Battles_multiple_Months)
## # A tibble: 6 x 9
## Battle count_Year count_Month Multiple_days Date State CWSAC Theater Outcome
## <chr> <int> <int> <lgl> <chr> <chr> <chr> <chr> <chr>
## 1 Battle… 1 2 TRUE May … Virg… D Eastern Inconc…
## 2 Siege … 1 2 TRUE Apri… Virg… B Eastern Inconc…
## 3 Battle… 1 2 TRUE May … Virg… B Eastern Inconc…
## 4 Battle… 1 2 TRUE Apri… Virg… C Eastern Inconc…
## 5 Battle… 1 2 TRUE Apri… Virg… C Eastern Inconc…
## 6 Battle… 1 2 TRUE Apri… Virg… A Eastern Confed…
# df of battles spanning multiple years
CivilWar_04 %>%
filter(count_Year != 1) -> Battles_multiple_Years
head(Battles_multiple_Years)
## # A tibble: 1 x 9
## Battle count_Year count_Month Multiple_days Date State CWSAC Theater Outcome
## <chr> <int> <int> <lgl> <chr> <chr> <chr> <chr> <chr>
## 1 Battle… 2 2 TRUE Dece… Tenn… A Western Union …
Start should contain the start-date of each battle.End should contain the end-date of each battle.separate()Date class objects.Date variable from each data frame.Battles_one_Day %>%
mutate(Start = mdy(Date), End = mdy(Date)) %>%
select(-Date) -> Battles_one_Day_01
# cancel useless variables
head(Battles_one_Day_01)
## # A tibble: 6 x 10
## Battle count_Year count_Month Multiple_days State CWSAC Theater Outcome
## <chr> <int> <int> <lgl> <chr> <chr> <chr> <chr>
## 1 Battle o… 1 1 FALSE Mary… B Eastern Union vict…
## 2 Battle o… 1 1 FALSE Mary… A Eastern Tactically…
## 3 Battle o… 1 1 FALSE Mary… D Eastern Inconclusi…
## 4 Battle o… 1 1 FALSE Mary… B Eastern Confederat…
## 5 Battle o… 1 1 FALSE Mary… D Eastern Inconclusi…
## 6 Battle o… 1 1 FALSE Penn… C Eastern Inconclusi…
## # … with 2 more variables: Start <date>, End <date>
Battles_one_Month %>%
separate(Date, c("tmp_1", "tmp_2", "tmp_3"), sep = " ") %>%
separate(tmp_2, c("startdate", "enddate"), sep = "-" ) %>%
# creating Start
mutate(Starttmp = paste(tmp_1, startdate, tmp_3),
Start = mdy(Starttmp)) %>%
# creating End
mutate(Endtmp = paste(tmp_1, enddate, tmp_3),
End = mdy(Endtmp)) %>%
# cancel useless variables
select(-tmp_1, -startdate, -enddate, -tmp_3, -Starttmp, -Endtmp) -> Battles_one_Month_01
head(Battles_one_Month_01)
## # A tibble: 6 x 10
## Battle count_Year count_Month Multiple_days State CWSAC Theater Outcome
## <chr> <int> <int> <lgl> <chr> <chr> <chr> <chr>
## 1 Battle … 1 1 TRUE Distr… B Eastern Union vict…
## 2 Battle … 1 1 TRUE Maryl… D Eastern Inconclusi…
## 3 Battle … 1 1 TRUE Maryl… C Eastern Inconclusi…
## 4 Battle … 1 1 TRUE Penns… A Eastern Union vict…
## 5 Battle … 1 1 TRUE Virgi… D Eastern Inconclusi…
## 6 Battle … 1 1 TRUE Virgi… B Eastern Inconclusi…
## # … with 2 more variables: Start <date>, End <date>
Battles_multiple_Months %>%
separate(Date, c("tmp_1", "tmp_2"), sep = "-") %>%
separate(tmp_2, c("tmp_3", "tmp_4"), sep = ",") %>%
# creating Start
mutate(Starttmp = paste(tmp_1, tmp_4),
Start = mdy(Starttmp)) %>%
# creating End
mutate(Endtmp = paste(tmp_3, tmp_4),
End = mdy(Endtmp)) %>%
# cancel useless variables
select(-tmp_1, -tmp_3, -tmp_4, -Starttmp, -Endtmp) -> Battles_multiple_Months_01
head(Battles_multiple_Months_01)
## # A tibble: 6 x 10
## Battle count_Year count_Month Multiple_days State CWSAC Theater Outcome
## <chr> <int> <int> <lgl> <chr> <chr> <chr> <chr>
## 1 Battle o… 1 2 TRUE Virg… D Eastern Inconclusi…
## 2 Siege of… 1 2 TRUE Virg… B Eastern Inconclusi…
## 3 Battle o… 1 2 TRUE Virg… B Eastern Inconclusi…
## 4 Battle o… 1 2 TRUE Virg… C Eastern Inconclusi…
## 5 Battle o… 1 2 TRUE Virg… C Eastern Inconclusi…
## 6 Battle o… 1 2 TRUE Virg… A Eastern Confederat…
## # … with 2 more variables: Start <date>, End <date>
Battles_multiple_Years %>%
separate(Date, c("tmp_1", "tmp_2"), sep = "-") %>%
# creating Start and End
mutate(Start = mdy(tmp_1),
End = mdy(tmp_2)) %>%
# cancel useless variables
select(-tmp_1, -tmp_2) -> Battles_multiple_Years_01
head(Battles_multiple_Years_01)
## # A tibble: 1 x 10
## Battle count_Year count_Month Multiple_days State CWSAC Theater Outcome
## <chr> <int> <int> <lgl> <chr> <chr> <chr> <chr>
## 1 Battle of… 2 2 TRUE Tenn… A Western Union vic…
## # … with 2 more variables: Start <date>, End <date>
Battles_one_Day_01 %>%
# use full_join to combine all df
full_join(Battles_one_Month_01,
by = c("Battle", "count_Year", "count_Month", "Multiple_days", "State", "CWSAC", "Theater", "Outcome", "Start", "End")) %>%
full_join(Battles_multiple_Months_01,
by = c("Battle", "count_Year", "count_Month", "Multiple_days", "State", "CWSAC", "Theater", "Outcome", "Start", "End")) %>%
full_join(Battles_multiple_Years_01,
by = c("Battle", "count_Year", "count_Month", "Multiple_days", "State", "CWSAC", "Theater", "Outcome", "Start", "End")) -> CivilWar_05 # tided all of the battles dates.
head(CivilWar_05)
## # A tibble: 6 x 10
## Battle count_Year count_Month Multiple_days State CWSAC Theater Outcome
## <chr> <int> <int> <lgl> <chr> <chr> <chr> <chr>
## 1 Battle o… 1 1 FALSE Mary… B Eastern Union vict…
## 2 Battle o… 1 1 FALSE Mary… A Eastern Tactically…
## 3 Battle o… 1 1 FALSE Mary… D Eastern Inconclusi…
## 4 Battle o… 1 1 FALSE Mary… B Eastern Confederat…
## 5 Battle o… 1 1 FALSE Mary… D Eastern Inconclusi…
## 6 Battle o… 1 1 FALSE Penn… C Eastern Inconclusi…
## # … with 2 more variables: Start <date>, End <date>
Siege of Port Hudson is the longest battle with 50 days.
CivilWar_05 %>%
mutate(Span = (End - Start)+1) %>%
arrange(desc(Span)) -> CivilWar_06
head(CivilWar_06, 1)
## # A tibble: 1 x 11
## Battle count_Year count_Month Multiple_days State CWSAC Theater Outcome
## <chr> <int> <int> <lgl> <chr> <chr> <chr> <chr>
## 1 Siege o… 1 2 TRUE Loui… A Western Union victo…
## # … with 3 more variables: Start <date>, End <date>, Span <drtn>
Create an appropriate plot.
Interpret the plot in one sentence to answer the question.
Extra Credit: Test for a linear relationship using lm() and interpret the results in one sentence based on the \(p\)-value and adjusted R-squared.
Interpretation: If this battle was a “Decisive boxplot”, it’s mean it mostly took the longest duration. By contrast, such as the “Limited boxplot”, it has not taken longer times during this period, also the result of this kind of battle was limited.
CivilWar_06 %>%
# recode in "CWSAC" variable
# the variable type of Period should be numeric
mutate(CWSAC = recode(CWSAC, "A" = "Decisive",
"B" = "Major",
"C" = "Formative",
"D" = "Limited"),
# the variable type of Period should be numeric type so that scale_y_log10() would be worked
Span = as.numeric(Span)) -> CivilWar_07
# starting to plot
CivilWar_07 %>%
# log10 not define 0 (0 day), so we need plus 1 after "Span"
ggplot(mapping = aes(x = CWSAC, y = Span+1, fill = CWSAC))+
geom_boxplot()+
theme_bw()+
scale_y_log10()+
xlab("Civil War Sites Advisory Commission")+
ylab("Span days")
- Extra Credit: Test for a linear relationship using
lm() and interpret the results in one sentence based on the \(p\)-value and adjusted R-squared.
CivilWar07_SLR <- lm(Span ~ CWSAC, CivilWar_07)
summary(CivilWar07_SLR)
##
## Call:
## lm(formula = Span ~ CWSAC, data = CivilWar_07)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.778 -1.808 -0.808 -0.680 43.222
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.7778 0.8625 7.858 4.06e-14 ***
## CWSACFormative -4.3920 1.0038 -4.375 1.57e-05 ***
## CWSACLimited -4.9907 1.0266 -4.861 1.71e-06 ***
## CWSACMajor -3.9701 1.0324 -3.845 0.000141 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.786 on 380 degrees of freedom
## Multiple R-squared: 0.0622, Adjusted R-squared: 0.0548
## F-statistic: 8.401 on 3 and 380 DF, p-value: 2.026e-05
coord_flip() to make horizontal boxplots.CivilWar_07 %>%
# count Battles of each State and add to new variable
group_by(State) %>%
mutate(num_Battles_States = n()) %>%
# filter out states with two or fewer battles.
filter(!num_Battles_States <= 2) -> CivilWar_08
CivilWar_08 %>%
# changing each State's name with regex
# reference: https://www.itread01.com/content/1548936390.html
mutate(State = str_replace_all(State,
"^North Dakota \\(Dakota Territory\\s\\sat the time\\)$",
"North Dakota"),
State = str_replace_all(State,
"^West Virginia \\(Virginia at the time\\)$",
"West Virginia"),
State = str_replace_all(State,
"^Oklahoma \\(Indian Territory at the time\\)$",
"Oklahoma")) %>%
# make sure State has factor type that we can use "forcats" package later
mutate(State = as.factor(State)) -> CivilWar_09
# check each State have been renamed successfully
unique(CivilWar_09$State)
## [1] Louisiana Mississippi Missouri North Carolina Virginia
## [6] Georgia South Carolina Alabama Tennessee Maryland
## [11] West Virginia Arkansas Kentucky Florida North Dakota
## [16] Oklahoma Texas Kansas
## 18 Levels: Alabama Arkansas Florida Georgia Kansas Kentucky ... West Virginia
CivilWar_09 %>%
# use fct_reorder to reorder the states by the start date.
ggplot(mapping = aes(x = fct_reorder(State, Start), y = Start, fill = Theater))+
# use `coord_flip()` to make horizontal boxplots.
coord_flip()+
geom_boxplot()+
labs(x = "State", y = "Start Date")