Instructions

Rename the starter file under the analysis directory as hw_01_yourname.Rmd and use it for your solutions.
1. Modify the “author” field in the YAML header.
2. Stage and Commit R Markdown and HTML files (no PDF files).
3. Push both .Rmd and HTML files to GitHub.
- Make sure you have knitted to HTML prior to staging, committing, and pushing your final submission.
4. Commit each time you answer a part of question, e.g. 1.1
5. Push to GitHub after each major question, e.g., Scrabble and Civil War Battles
- Committing and Pushing are graded elements for this homework.
6. When complete, submit a response in Canvas

Loading Libraries

library(readr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3     ✓ dplyr   1.0.6
## ✓ tibble  3.1.2     ✓ stringr 1.4.0
## ✓ tidyr   1.1.3     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(tidyr)
library(ggthemes)
library(forcats)

1 Scrabble Words

For this exercise, we are using the Collins Scrabble Words, which is most commonly used outside of the United States. The dictionary most often used in the United States is the Tournament Word List.

WARNING: Do not try str_view() or str_view_all() on these data.It will stall your computer.

  1. Use a readr function to load the 2015 list of Collins Scrabble Words into R from your data folder or from https://data-science-master.github.io/lectures/data/words.txt
    • (note: “NA” is an official Scrabble word).
scrabble <- read_tsv(file = "../data/words.txt")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   word = col_character()
## )
scrabble %>%
  # removed NA
  filter(!is.na(word)) -> scrabble_01 
head(scrabble_01)
## # A tibble: 6 x 1
##   word  
##   <chr> 
## 1 AA    
## 2 AAH   
## 3 AAHED 
## 4 AAHING
## 5 AAHS  
## 6 AAL
  1. What are the six longest words that have the most “X”’s in them?
scrabble_01 %>%
  # count the length of word, and count "X" in each word
  mutate(Xcount = str_count(word, "X"),
         count = str_length(word)) -> scrabble_02

scrabble_02 %>%
  arrange(desc(Xcount)) %>%
  filter(Xcount == max(scrabble_02$Xcount)) %>%
  arrange(desc(count)) -> six_longest

head(six_longest)
## # A tibble: 6 x 3
##   word          Xcount count
##   <chr>          <int> <int>
## 1 COEXECUTRIXES      2    13
## 2 EXTRATEXTUAL       2    12
## 3 COEXECUTRIX        2    11
## 4 EXECUTRIXES        2    11
## 5 SAXITOXINS         2    10
## 6 XANTHOXYLS         2    10
  1. How many words have an identical first and second half of the word? If a word has an odd number of letters, exclude the middle character.
scrabble_02 %>%
  # distinguish whether the length of word is odd or even
  mutate(odd_even = if_else(str_length(word) %% 2 == 0, "even", "odd")) %>%
  
  # filter head half, if the length is odd, we remain the middle one of this word.
  mutate(head_h = if_else(odd_even == "even", 
                          str_sub(word, 1, count/2), str_sub(word, 1, floor(count/2)))) %>%

  # filter tail half
  mutate(tail_h = if_else(odd_even == "even",
                          str_sub(word, (count/2)+1, count), str_sub(word, ceiling(count/2)+1, count))) %>%
  # Equal is TRUE, otherwise FALSE, and adding in df
  mutate(Halves = str_detect(head_h, tail_h)) -> scrabble_03

  # count number if Halves is TRUE
scrabble_03 %>%
  filter(Halves == TRUE) %>%
  nrow()
## [1] 254
  # scrabble_02 %>% arrange(desc(count)) -> make sure the longest string is 15 units, 
  # so the half of string is 7 units at most

  # () = group
  #.? = 0 or 1
  # period = any word 
  # \\1 for the match in the first parentheses

scrabble_03 %>% 
  filter(str_detect(word, "^(.|..|...|....|.....|......|.......).?\\1$")) -> Two_Halves

nrow(Two_Halves)
## [1] 254
  1. Use the results from 3 to find the longest word with an identical first and second half of the word?
Two_Halves %>%
  arrange(desc(count)) %>%
  head(1)
## # A tibble: 1 x 7
##   word         Xcount count odd_even head_h tail_h Halves
##   <chr>         <int> <int> <chr>    <chr>  <chr>  <lgl> 
## 1 CHIQUICHIQUI      0    12 even     CHIQUI CHIQUI TRUE

2 Civil War Battles

The data in “civil_war_theater.csv” contains a information on American Civil War battles, taken from Wikipedia.

Variables include:

  1. Use a readr function and relative paths to load the data into R.
CivilWar <- read_csv(file = "../data/civil_war_theater.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Battle = col_character(),
##   Date = col_character(),
##   State = col_character(),
##   CWSAC = col_character(),
##   Theater = col_character(),
##   Outcome = col_character()
## )
head(CivilWar)
## # A tibble: 6 x 6
##   Battle          Date      State      CWSAC Theater Outcome                    
##   <chr>           <chr>     <chr>      <chr> <chr>   <chr>                      
## 1 Battle of Fort… July 11-… District … B     Eastern Union victory: Failed Conf…
## 2 Battle of Hanc… January … Maryland   D     Eastern Inconclusive: Unsuccessful…
## 3 Battle of Sout… Septembe… Maryland   B     Eastern Union victory: McClellan d…
## 4 Battle of Anti… Septembe… Maryland   A     Eastern Tactically inconclusive; s…
## 5 Battle of Will… July 6-1… Maryland   C     Eastern Inconclusive: Meade and Le…
## 6 Battle of Boon… July 8, … Maryland   D     Eastern Inconclusive: Indecisive a…

The next several questions will help you take the dates from all the different formats and create a consistent set of start date and end date variables in the data frame. We will start by calculating how many years, and months are in each battle.

  1. Add a variable to the data frame with the number of years for each battle.
CivilWar %>%
  mutate(count_Year = str_count(Date, year_regex)) %>%
  select(Battle,count_Year, everything()) -> CivilWar_02

head(CivilWar_02)
## # A tibble: 6 x 7
##   Battle        count_Year Date     State    CWSAC Theater Outcome              
##   <chr>              <int> <chr>    <chr>    <chr> <chr>   <chr>                
## 1 Battle of Fo…          1 July 11… Distric… B     Eastern Union victory: Faile…
## 2 Battle of Ha…          1 January… Maryland D     Eastern Inconclusive: Unsucc…
## 3 Battle of So…          1 Septemb… Maryland B     Eastern Union victory: McCle…
## 4 Battle of An…          1 Septemb… Maryland A     Eastern Tactically inconclus…
## 5 Battle of Wi…          1 July 6-… Maryland C     Eastern Inconclusive: Meade …
## 6 Battle of Bo…          1 July 8,… Maryland D     Eastern Inconclusive: Indeci…
  1. Add a variable to the data frame with the number of months for each battle.
CivilWar_02 %>%
  mutate(count_Month = str_count(Date, month_name)) %>%
  select(Battle,count_Year,count_Month, everything()) -> CivilWar_03

head(CivilWar_03)  
## # A tibble: 6 x 8
##   Battle     count_Year count_Month Date    State  CWSAC Theater Outcome        
##   <chr>           <int>       <int> <chr>   <chr>  <chr> <chr>   <chr>          
## 1 Battle of…          1           1 July 1… Distr… B     Eastern Union victory:…
## 2 Battle of…          1           1 Januar… Maryl… D     Eastern Inconclusive: …
## 3 Battle of…          1           1 Septem… Maryl… B     Eastern Union victory:…
## 4 Battle of…          1           1 Septem… Maryl… A     Eastern Tactically inc…
## 5 Battle of…          1           1 July 6… Maryl… C     Eastern Inconclusive: …
## 6 Battle of…          1           1 July 8… Maryl… D     Eastern Inconclusive: …
  1. Add a variable to the data frame that is TRUE if Date spans multiple days and is FALSE otherwise. Spanning multiple months and/or years also counts as TRUE.
CivilWar_03 %>%
  # str_detect: Match a fixed string, and return TRUE or FALSE
  mutate(Multiple_days = str_detect(Date, "-")) %>%
  select(Battle:count_Month, Multiple_days, everything()) -> CivilWar_04

head(CivilWar_04)
## # A tibble: 6 x 9
##   Battle  count_Year count_Month Multiple_days Date  State CWSAC Theater Outcome
##   <chr>        <int>       <int> <lgl>         <chr> <chr> <chr> <chr>   <chr>  
## 1 Battle…          1           1 TRUE          July… Dist… B     Eastern Union …
## 2 Battle…          1           1 TRUE          Janu… Mary… D     Eastern Inconc…
## 3 Battle…          1           1 FALSE         Sept… Mary… B     Eastern Union …
## 4 Battle…          1           1 FALSE         Sept… Mary… A     Eastern Tactic…
## 5 Battle…          1           1 TRUE          July… Mary… C     Eastern Inconc…
## 6 Battle…          1           1 FALSE         July… Mary… D     Eastern Inconc…
  1. Make four new data frames by filtering the data based on the length of the battles:
    • a data frame with the data for only those battles spanning just one day,
    • a data frame with the data for only those battles spanning multiple days in just one month,
    • a data frame with the data for only those battles spanning multiple months but not multiple years, and,
    • a data frame with the data for only those battles spanning multiple years.
# df of battles spanning just one day
CivilWar_04 %>%
  filter(Multiple_days == FALSE) -> Battles_one_Day

head(Battles_one_Day)
## # A tibble: 6 x 9
##   Battle  count_Year count_Month Multiple_days Date  State CWSAC Theater Outcome
##   <chr>        <int>       <int> <lgl>         <chr> <chr> <chr> <chr>   <chr>  
## 1 Battle…          1           1 FALSE         Sept… Mary… B     Eastern Union …
## 2 Battle…          1           1 FALSE         Sept… Mary… A     Eastern Tactic…
## 3 Battle…          1           1 FALSE         July… Mary… D     Eastern Inconc…
## 4 Battle…          1           1 FALSE         July… Mary… B     Eastern Confed…
## 5 Battle…          1           1 FALSE         Augu… Mary… D     Eastern Inconc…
## 6 Battle…          1           1 FALSE         June… Penn… C     Eastern Inconc…
# df of battles spanning multiple days in just one month
CivilWar_04 %>%
   filter(Multiple_days == TRUE & count_Month == 1) -> Battles_one_Month

head(Battles_one_Month)
## # A tibble: 6 x 9
##   Battle  count_Year count_Month Multiple_days Date  State CWSAC Theater Outcome
##   <chr>        <int>       <int> <lgl>         <chr> <chr> <chr> <chr>   <chr>  
## 1 Battle…          1           1 TRUE          July… Dist… B     Eastern Union …
## 2 Battle…          1           1 TRUE          Janu… Mary… D     Eastern Inconc…
## 3 Battle…          1           1 TRUE          July… Mary… C     Eastern Inconc…
## 4 Battle…          1           1 TRUE          July… Penn… A     Eastern Union …
## 5 Battle…          1           1 TRUE          May … Virg… D     Eastern Inconc…
## 6 Battle…          1           1 TRUE          Marc… Virg… B     Eastern Inconc…
# df of battles spanning multiple months but not multiple years
CivilWar_04 %>%
  filter(Multiple_days == TRUE & count_Year == 1 & count_Month != 1) -> Battles_multiple_Months

head(Battles_multiple_Months)
## # A tibble: 6 x 9
##   Battle  count_Year count_Month Multiple_days Date  State CWSAC Theater Outcome
##   <chr>        <int>       <int> <lgl>         <chr> <chr> <chr> <chr>   <chr>  
## 1 Battle…          1           2 TRUE          May … Virg… D     Eastern Inconc…
## 2 Siege …          1           2 TRUE          Apri… Virg… B     Eastern Inconc…
## 3 Battle…          1           2 TRUE          May … Virg… B     Eastern Inconc…
## 4 Battle…          1           2 TRUE          Apri… Virg… C     Eastern Inconc…
## 5 Battle…          1           2 TRUE          Apri… Virg… C     Eastern Inconc…
## 6 Battle…          1           2 TRUE          Apri… Virg… A     Eastern Confed…
# df of battles spanning multiple years
CivilWar_04 %>%
  filter(count_Year != 1) -> Battles_multiple_Years

head(Battles_multiple_Years)
## # A tibble: 1 x 9
##   Battle  count_Year count_Month Multiple_days Date  State CWSAC Theater Outcome
##   <chr>        <int>       <int> <lgl>         <chr> <chr> <chr> <chr>   <chr>  
## 1 Battle…          2           2 TRUE          Dece… Tenn… A     Western Union …
  1. For each of the four new data frames,
Battles_one_Day %>%
  mutate(Start = mdy(Date), End = mdy(Date)) %>%
  select(-Date) -> Battles_one_Day_01

# cancel useless variables
head(Battles_one_Day_01)
## # A tibble: 6 x 10
##   Battle    count_Year count_Month Multiple_days State CWSAC Theater Outcome    
##   <chr>          <int>       <int> <lgl>         <chr> <chr> <chr>   <chr>      
## 1 Battle o…          1           1 FALSE         Mary… B     Eastern Union vict…
## 2 Battle o…          1           1 FALSE         Mary… A     Eastern Tactically…
## 3 Battle o…          1           1 FALSE         Mary… D     Eastern Inconclusi…
## 4 Battle o…          1           1 FALSE         Mary… B     Eastern Confederat…
## 5 Battle o…          1           1 FALSE         Mary… D     Eastern Inconclusi…
## 6 Battle o…          1           1 FALSE         Penn… C     Eastern Inconclusi…
## # … with 2 more variables: Start <date>, End <date>
Battles_one_Month %>%
  separate(Date, c("tmp_1", "tmp_2", "tmp_3"), sep = " ") %>%
  separate(tmp_2, c("startdate", "enddate"), sep = "-" ) %>%
# creating Start
  mutate(Starttmp = paste(tmp_1, startdate, tmp_3),
         Start = mdy(Starttmp)) %>% 
# creating End
  mutate(Endtmp = paste(tmp_1, enddate, tmp_3),
         End = mdy(Endtmp)) %>%
# cancel useless variables
  select(-tmp_1, -startdate, -enddate, -tmp_3, -Starttmp, -Endtmp) -> Battles_one_Month_01

head(Battles_one_Month_01)
## # A tibble: 6 x 10
##   Battle   count_Year count_Month Multiple_days State  CWSAC Theater Outcome    
##   <chr>         <int>       <int> <lgl>         <chr>  <chr> <chr>   <chr>      
## 1 Battle …          1           1 TRUE          Distr… B     Eastern Union vict…
## 2 Battle …          1           1 TRUE          Maryl… D     Eastern Inconclusi…
## 3 Battle …          1           1 TRUE          Maryl… C     Eastern Inconclusi…
## 4 Battle …          1           1 TRUE          Penns… A     Eastern Union vict…
## 5 Battle …          1           1 TRUE          Virgi… D     Eastern Inconclusi…
## 6 Battle …          1           1 TRUE          Virgi… B     Eastern Inconclusi…
## # … with 2 more variables: Start <date>, End <date>
Battles_multiple_Months %>%
  separate(Date, c("tmp_1", "tmp_2"), sep = "-") %>%
  separate(tmp_2, c("tmp_3", "tmp_4"), sep = ",") %>%
# creating Start
  mutate(Starttmp = paste(tmp_1, tmp_4),
         Start = mdy(Starttmp)) %>% 
# creating End
  mutate(Endtmp = paste(tmp_3, tmp_4),
         End = mdy(Endtmp)) %>%
# cancel useless variables
  select(-tmp_1, -tmp_3, -tmp_4, -Starttmp, -Endtmp) -> Battles_multiple_Months_01

head(Battles_multiple_Months_01) 
## # A tibble: 6 x 10
##   Battle    count_Year count_Month Multiple_days State CWSAC Theater Outcome    
##   <chr>          <int>       <int> <lgl>         <chr> <chr> <chr>   <chr>      
## 1 Battle o…          1           2 TRUE          Virg… D     Eastern Inconclusi…
## 2 Siege of…          1           2 TRUE          Virg… B     Eastern Inconclusi…
## 3 Battle o…          1           2 TRUE          Virg… B     Eastern Inconclusi…
## 4 Battle o…          1           2 TRUE          Virg… C     Eastern Inconclusi…
## 5 Battle o…          1           2 TRUE          Virg… C     Eastern Inconclusi…
## 6 Battle o…          1           2 TRUE          Virg… A     Eastern Confederat…
## # … with 2 more variables: Start <date>, End <date>
Battles_multiple_Years %>%
  separate(Date, c("tmp_1", "tmp_2"), sep = "-") %>%
# creating Start and End
  mutate(Start = mdy(tmp_1),
         End = mdy(tmp_2)) %>%
# cancel useless variables
  select(-tmp_1, -tmp_2) -> Battles_multiple_Years_01

head(Battles_multiple_Years_01)
## # A tibble: 1 x 10
##   Battle     count_Year count_Month Multiple_days State CWSAC Theater Outcome   
##   <chr>           <int>       <int> <lgl>         <chr> <chr> <chr>   <chr>     
## 1 Battle of…          2           2 TRUE          Tenn… A     Western Union vic…
## # … with 2 more variables: Start <date>, End <date>
  1. Create a new data frame with all the battles and the Start and End dates by binding the rows of the four data frames as updated in part 6
Battles_one_Day_01 %>%
  # use full_join to combine all df
  full_join(Battles_one_Month_01,
            by = c("Battle", "count_Year", "count_Month", "Multiple_days", "State", "CWSAC", "Theater", "Outcome", "Start", "End")) %>%
  full_join(Battles_multiple_Months_01,
            by = c("Battle", "count_Year", "count_Month", "Multiple_days", "State", "CWSAC", "Theater", "Outcome", "Start", "End")) %>%
  full_join(Battles_multiple_Years_01,
            by = c("Battle", "count_Year", "count_Month", "Multiple_days", "State", "CWSAC", "Theater", "Outcome", "Start", "End")) -> CivilWar_05 # tided all of the battles dates.

head(CivilWar_05)
## # A tibble: 6 x 10
##   Battle    count_Year count_Month Multiple_days State CWSAC Theater Outcome    
##   <chr>          <int>       <int> <lgl>         <chr> <chr> <chr>   <chr>      
## 1 Battle o…          1           1 FALSE         Mary… B     Eastern Union vict…
## 2 Battle o…          1           1 FALSE         Mary… A     Eastern Tactically…
## 3 Battle o…          1           1 FALSE         Mary… D     Eastern Inconclusi…
## 4 Battle o…          1           1 FALSE         Mary… B     Eastern Confederat…
## 5 Battle o…          1           1 FALSE         Mary… D     Eastern Inconclusi…
## 6 Battle o…          1           1 FALSE         Penn… C     Eastern Inconclusi…
## # … with 2 more variables: Start <date>, End <date>
  1. Calculate the number of days each battle spanned.

Siege of Port Hudson is the longest battle with 50 days.

CivilWar_05 %>%
  mutate(Span = (End - Start)+1) %>%
  arrange(desc(Span)) -> CivilWar_06

head(CivilWar_06, 1)
## # A tibble: 1 x 11
##   Battle   count_Year count_Month Multiple_days State CWSAC Theater Outcome     
##   <chr>         <int>       <int> <lgl>         <chr> <chr> <chr>   <chr>       
## 1 Siege o…          1           2 TRUE          Loui… A     Western Union victo…
## # … with 3 more variables: Start <date>, End <date>, Span <drtn>
  1. Is there an association between the CWSAC significance of a battle and its duration?
CivilWar_06 %>%
 # recode in "CWSAC" variable
 # the variable type of Period should be numeric
  mutate(CWSAC = recode(CWSAC, "A" = "Decisive",
                               "B" = "Major",
                               "C" = "Formative",
                               "D" =  "Limited"),
 # the variable type of Period should be numeric type so that scale_y_log10() would be worked
         Span = as.numeric(Span)) -> CivilWar_07

# starting to plot  
CivilWar_07 %>%
 # log10 not define 0 (0 day), so we need plus 1 after "Span"

  ggplot(mapping = aes(x = CWSAC, y = Span+1, fill = CWSAC))+
  geom_boxplot()+
  theme_bw()+
  scale_y_log10()+
  xlab("Civil War Sites Advisory Commission")+
  ylab("Span days")

- Extra Credit: Test for a linear relationship using lm() and interpret the results in one sentence based on the \(p\)-value and adjusted R-squared.

CivilWar07_SLR <- lm(Span ~ CWSAC, CivilWar_07)
summary(CivilWar07_SLR) 
## 
## Call:
## lm(formula = Span ~ CWSAC, data = CivilWar_07)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.778 -1.808 -0.808 -0.680 43.222 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      6.7778     0.8625   7.858 4.06e-14 ***
## CWSACFormative  -4.3920     1.0038  -4.375 1.57e-05 ***
## CWSACLimited    -4.9907     1.0266  -4.861 1.71e-06 ***
## CWSACMajor      -3.9701     1.0324  -3.845 0.000141 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.786 on 380 degrees of freedom
## Multiple R-squared:  0.0622, Adjusted R-squared:  0.0548 
## F-statistic: 8.401 on 3 and 380 DF,  p-value: 2.026e-05
  1. Extra Credit: Did the theaters of war shift during the American Civil War?
CivilWar_07 %>%
  # count Battles of each State and add to new variable
  group_by(State) %>%
  mutate(num_Battles_States = n()) %>%
  # filter out states with two or fewer battles.
  filter(!num_Battles_States  <= 2)  -> CivilWar_08

CivilWar_08 %>%
  # changing each State's name with regex
  # reference: https://www.itread01.com/content/1548936390.html
  mutate(State = str_replace_all(State, 
                                 "^North Dakota \\(Dakota Territory\\s\\sat the time\\)$",
                                 "North Dakota"),
         State = str_replace_all(State,
                                 "^West Virginia \\(Virginia at the time\\)$",
                                 "West Virginia"),
         State = str_replace_all(State,
                                 "^Oklahoma \\(Indian Territory at the time\\)$",
                                 "Oklahoma")) %>%
  # make sure State has factor type that we can use "forcats" package later
  mutate(State = as.factor(State)) -> CivilWar_09
                          
# check each State have been renamed successfully
unique(CivilWar_09$State)
##  [1] Louisiana      Mississippi    Missouri       North Carolina Virginia      
##  [6] Georgia        South Carolina Alabama        Tennessee      Maryland      
## [11] West Virginia  Arkansas       Kentucky       Florida        North Dakota  
## [16] Oklahoma       Texas          Kansas        
## 18 Levels: Alabama Arkansas Florida Georgia Kansas Kentucky ... West Virginia
CivilWar_09 %>%
  # use fct_reorder to reorder the states by the start date. 
  ggplot(mapping = aes(x = fct_reorder(State, Start), y = Start, fill = Theater))+
  # use `coord_flip()` to make horizontal boxplots.
  coord_flip()+
  geom_boxplot()+
  labs(x = "State", y = "Start Date")