Server or Local R

You can either log in to RStudio here or you can find RStudio on your computer. Just go to the spotlight (in the upper right) and type in RStudio. If you work on the server, packages may already be installed for you. If you work locally, code may run faster. In my perfect world, you would be comfortable doing either.

readr or not to readr

The readr package some of us used last time allows us to read in .csv files faster using the read_csv() function but if the readr installation isn’t working, we can just use read.csv() which is in base R (no package needed). For instance, you can read in polling data using either:

# for all lab and data options
library(dplyr)

### readr version
library(readr)
polls_url <- "https://projects.fivethirtyeight.com/polls-page/president_primary_polls.csv"
polls <- read_csv(polls_url)

# or #

## base R version

polls_url <- "https://projects.fivethirtyeight.com/polls-page/president_primary_polls.csv"
polls <- read.csv(polls_url)

Data Choices

Since some of you may have had your fill of election data. I’m giving you three options for a join lab: Elections, Baseball and Spelling. You can start with any one you choose. Those goal is to understand joins in R. If you’re feeling uncertain after one of these sections, please attempt a second one. In all cases, you’ll be learning about “joins” and after you try a few different kinds of joins, you my want to glance back at the following Venn diagrams.

1. Elections… Continued

Let’s start by loading the same election data as yesterday

polls_url <- "https://projects.fivethirtyeight.com/polls-page/president_primary_polls.csv"

polls <- read.csv(polls_url)

dem_primary_polls <- polls %>% filter(stage=="primary", party=="DEM")



dem_primary_polls <- dem_primary_polls %>% 
  mutate(start_date=as.Date(start_date, "%m/%d/%y"),
         end_date=as.Date(start_date, "%m/%d/%y"),
         create_date=as.Date(created_at, "%m/%d/%y")
         ) 

But now let’s also load a data set with letter grades matched with numbers

grades_url <- "https://raw.githubusercontent.com/jfcross4/data/master/fte_grades.csv"
grades <- read_csv(grades_url)
View(grades)

We can join these two data sets and have numerical grades (for the quality of the pollster) for each line of data in our polling data set

dem_primary_polls <- left_join(dem_primary_polls, grades, by="fte_grade")

Now, we could filter out low quality pollsters and/or assign more weight to polls with better pollsters.

dem_primary_polls %>% filter(fte_num_grade<= 10)

dem_primary_polls %>% filter(state=="Iowa", !is.na(fte_num_grade))  %>%
  mutate(days_before_iowa = as.numeric(as.Date("2020/02/03")-end_date),
         iowa_weight = (0.98^days_before_iowa) ,
         iowa_grade_weight = (0.98^days_before_iowa)*(3/(2+fte_num_grade)) ) %>% 
  group_by(answer) %>% arrange(desc(end_date)) %>% 
  summarize(simple_polling_avg = mean(pct),
            weighted_polling_avg = sum(pct*iowa_weight)/sum(iowa_weight),
            weighted_grade_polling_avg = sum(pct*iowa_grade_weight)/sum(iowa_grade_weight),
            num_polls = n(),
            most_recent_poll = first(end_date)) %>%
  filter(num_polls > 10, simple_polling_avg >=2) %>%
  arrange(desc(weighted_polling_avg)) %>%
  filter(most_recent_poll > "2020/01/01")

Try using polling grades to improve your code for Iowa Caucus predictions.

2. Baseball

The Lahman package has baseball data dating back to 1871. The following code shows you some of the data sets it provides.

library(Lahman)

View(Batting)

View(People)

View(Fielding)

?Lahman

Notice that players are referred to by playerID’s. This is actually quite handy. Using names for players can cause trouble since there might be two Chris Young’s or Vladimir Guerrero’s and one data set might list a player as Joe Smith while another lists him as Joseph Smith and matching up these data sets will be a headache.

Let’s try finding the the home run leaders among players who have played at least 50 games at SS in the year in question. First we’ll create a data set of player-seasons with 80 games at short.

ss_seasons <- Fielding %>% filter(POS=="SS", G>=80) %>% select(playerID, yearID)
View(ss_seasons)

Now, we can join this to the batting data. We have to join it so that every line in the shortstop data is matched only with lines in the Batting table with both the same player and the same season.

ss_Batting <- left_join(ss_seasons, Batting, by=c("playerID", "yearID"))

ss_Batting %>% top_n(20, HR) %>% arrange(desc(HR)) %>% select(playerID, yearID, HR)

Of course, we might want actually player names at this point, and perhaps I’m interested in how old each player was during each season. So, I can join this data with the “People” data set. Note that this time I only want to join the two tables on playerID. Further note that we’re using the select() function within a join function.

ss_Batting <- left_join(ss_Batting, People %>% select(playerID, nameFirst, nameLast, birthYear), by=c("playerID"))
ss_Batting %>% mutate(age = yearID - birthYear) %>%
  top_n(20, HR) %>% 
  arrange(desc(HR)) %>% select(nameFirst, nameLast, age, yearID, HR)

Finally, let’s suppose I’m interest in the age at which the most 40 HR seasons take place (across positions). I could find out by doing the following

left_join(Batting %>% 
            filter(HR>=40), People %>% select(playerID, nameFirst, nameLast, birthYear),by=c("playerID")) %>%
        mutate(age = yearID - birthYear) %>% 
        group_by(age) %>% count() %>%
        arrange(desc(n))

Based on the previous results, at roughly what age do you think that players reach their peak in home runs?

Can you find out at what age players reach their peak in stolen bases? Or (a challenge for baseball fans), at what age they reach their peak in batting average?

3. Spelling

Let’s start by loading two data sets. One is a list of 20,137 words along with how many syllables they are. The other is a list of 4,505 common misspellings from Wikipedia along with the correct spelling of each word. Take a look at each of these data sets using View().

words <- read.csv("https://raw.githubusercontent.com/jfcross4/data/master/words.csv")

misspellings <- read.csv("https://raw.githubusercontent.com/jfcross4/data/master/misspellings.csv")

Let’s look at how many words have each number of syllables.

words %>% group_by(syllables) %>% count()

and at words with the most different misspellings and the misspellings which correspond to the most correct words.

misspellings %>% group_by(correct) %>% count() %>% filter(n>1) %>% arrange(desc(n))

misspellings %>% group_by(misspelling) %>% count() %>% filter(n>1) %>% arrange(desc(n))

If you’re wondering about the 24 different ways to misspell “publisher” you can find it here.

misspellings %>% filter(correct=="publisher")

or which words, someone might by trying to spell when they type “buring”:

misspellings %>% filter(misspelling=="buring")

We can also count characters.

misspellings <- misspellings %>% mutate(n_characters_correct = nchar(as.character(correct)))

But, what I’m most interested in is how often words of different numbers of syllables are misspelled. To answer this, I need to join the misspellings data with the words data. I can join them by matching up the correct spellings with the word names. Try each of the following and pay attention to the differences.

word_join1 <- left_join(misspellings, words, by=c("correct"="word"))

word_join2 <- left_join(words, misspellings, by=c("word" = "correct"))

word_join3 <- inner_join(words, misspellings, by=c("word" = "correct"))

word_join4 <- full_join(words, misspellings, by=c("word" = "correct"))

Depending on which question you’re trying to answer a different join choice may be appropriate and we can use group by to answer our question.

word_join1 %>% group_by(syllables) %>% count()

word_join2 %>% group_by(syllables) %>% count()