In this project I used the table found in the following link: https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States#Living_former_presidents. The table shows the information about living former US presidents.
- Part 1: Create a .CSV file that includes all of the information above:
- Part 2: Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your “data”:
- Part 3: Perform analysis on the “data”:
Part 1:
- Installing all the necessary packages needed for this data analysis.
library(rvest)
## Warning: package 'rvest' was built under R version 3.2.2
## Loading required package: xml2
## Warning: package 'xml2' was built under R version 3.2.2
library(stringr)
library(curl)
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.2.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.2
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
Up
- I used
rvest package to scrape, the table from “Wikipedia,” of living former president. This step for me was very hard for me, since I am new to web scraping. After finally getting r to load the table I noticed that I had to format the data, since it did not look like the table online. Therefore I used stringr to format the data and created the table “LF_President”. The new table looked exactly like the table found online.
url <- read_html("https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States", Encoding = "UTF-8")
table1 <- url %>% html_nodes("table") %>% .[[2]] %>% html_table()
table1
## President Term of office Date of birth
## 1 George H. W. Bush 1989â<U+0080><U+0093>1993 (1924-06-12) June 12, 1924 (age 91)
## 2 Jimmy Carter 1977â<U+0080><U+0093>1981 (1924-10-01) October 1, 1924 (age 91)
## 3 George W. Bush 2001â<U+0080><U+0093>2009 (1946-07-06) July 6, 1946 (age 69)
## 4 Bill Clinton 1993â<U+0080><U+0093>2001 (1946-08-19) August 19, 1946 (age 69)
LF_President <- table1
LF_President$"Term of office" <- str_replace_all(LF_President$"Term of office", pattern = "(\\d{4}).*?(\\d{4})", replacement = "\\1\\-\\2 ")
LF_President$"Date of birth" <- LF_President$"Date of birth" %>% str_replace_all(pattern = "\\(\\d{4}\\-\\d{2}\\-\\d{2}\\)", replacement = "") %>% str_replace_all(pattern = "age.\\s", replacement = "age ")
LF_President
## President Term of office Date of birth
## 1 George H. W. Bush 1989-1993 June 12, 1924 (age 91)
## 2 Jimmy Carter 1977-1981 October 1, 1924 (age 91)
## 3 George W. Bush 2001-2009 July 6, 1946 (age 69)
## 4 Bill Clinton 1993-2001 August 19, 1946 (age 69)
Up
- I created a .CSV file with the living former president table in my local GitHub repository.
write.csv(LF_President, file = "C:/Users/Nabila/Documents/GitHub/Class-IS607/Project 2/Living Former Presidents/Living_Former_Presidents.csv")
Up
Part 2:
- Using the library
curl I uploaded the living former president table from my online GitHub repository.
Living_President <- read.csv(file="https://raw.githubusercontent.com/nabilahossain/Class-IS607/master/Project%202/Living%20Former%20Presidents/Living_Former_Presidents.csv", header=TRUE, sep=",")
Living_President
## X President Term.of.office Date.of.birth
## 1 1 George H. W. Bush 1989-1993 June 12, 1924 (age 91)
## 2 2 Jimmy Carter 1977-1981 October 1, 1924 (age 91)
## 3 3 George W. Bush 2001-2009 July 6, 1946 (age 69)
## 4 4 Bill Clinton 1993-2001 August 19, 1946 (age 69)
Up
- Using libraries
dplyrand tidyr I transformed and tidied the data. Since the original table had a lot of data mashed together, I created a new table “Living Former President 1” (LFP1) by transforming the original table. I separated the 3rd and 4th columns into six columns. I originally wanted to separate the president’s name in first, last and middle name, however since we have two “George Bush” I thought it would be confusing to study the data with only last or first name.
LFP1 <- Living_President %>% separate(Term.of.office, c("Term_Start_Year", "Term_End_Year"), sep = "-") %>% separate(Date.of.birth, c("Y", "Birth_Month", "Birth_Day", "Birth_Year", "Z", "Age"), extra = "drop") %>% select(-X, -Y, -Z)
LFP1
## President Term_Start_Year Term_End_Year Birth_Month Birth_Day
## 1 George H. W. Bush 1989 1993 June 12
## 2 Jimmy Carter 1977 1981 October 1
## 3 George W. Bush 2001 2009 July 6
## 4 Bill Clinton 1993 2001 August 19
## Birth_Year Age
## 1 1924 91
## 2 1924 91
## 3 1946 69
## 4 1946 69
Up
- I further transform and tidy the data. I created a table “Living_Former_President,” which has the president’s birth month in numeric, how many years they served, and their age when they were in term.
LFP1$Term_End_Year <- as.numeric(LFP1$Term_End_Year)
LFP1$Term_Start_Year <- as.numeric(LFP1$Term_Start_Year)
LFP1$Birth_Year <- as.numeric(LFP1$Birth_Year)
LFP1$Birth_Day <- as.numeric(LFP1$Birth_Day)
LFP1$Birth_Day <- sprintf("%02d", LFP1$Birth_Day)
LFP1$Age <- as.numeric(LFP1$Age)
Living_Former_President <- LFP1 %>% mutate(Years_In_Term = Term_End_Year - Term_Start_Year) %>% mutate(Age_Start_Term = Term_Start_Year - Birth_Year) %>% mutate(Age_End_Term = Term_End_Year - Birth_Year) %>% mutate(Term_Served = Years_In_Term / 4 )
Living_Former_President$Birth_Month <- Living_Former_President$Birth_Month %>% as.character.Date() %>% str_replace(pattern = "June\\s{1,}", replacement = "6") %>% str_replace(pattern = "October", replacement = "10") %>% str_replace(pattern = "July\\s{1,}", replacement = "7") %>% str_replace(pattern = "August\\s", replacement = "8") %>% as.numeric()
Living_Former_President
## President Term_Start_Year Term_End_Year Birth_Month Birth_Day
## 1 George H. W. Bush 1989 1993 6 12
## 2 Jimmy Carter 1977 1981 10 01
## 3 George W. Bush 2001 2009 7 06
## 4 Bill Clinton 1993 2001 8 19
## Birth_Year Age Years_In_Term Age_Start_Term Age_End_Term Term_Served
## 1 1924 91 4 65 69 1
## 2 1924 91 4 53 57 1
## 3 1946 69 8 55 63 2
## 4 1946 69 8 47 55 2
Part 3:
- The table below shows the original data that was presented on the website.
kable(select(Living_President, -X), caption = "Table 1: Original table living former president (online).", align = "c")
Table 1: Original table living former president (online).
| George H. W. Bush |
1989-1993 |
June 12, 1924 (age 91) |
| Jimmy Carter |
1977-1981 |
October 1, 1924 (age 91) |
| George W. Bush |
2001-2009 |
July 6, 1946 (age 69) |
| Bill Clinton |
1993-2001 |
August 19, 1946 (age 69) |
- The table below lists the presidents’ by their birth date, from oldest to youngest. We see that George H. W. Bush is older then Jimmy Carter even though they both were born in the same year, 1924. We can also see that they both served one term. On the other hand George W. Bush is older then Bill Clinton (they were also born in the same year, 1946) and they both served two terms.
Living_Former_President$Birth_Month <- sprintf("%02d", Living_Former_President$Birth_Month)
LFP2 <- Living_Former_President %>% select(President, Birth_Month, Birth_Day, Birth_Year, Term_Start_Year, Term_End_Year, Term_Served) %>% arrange( Birth_Year, Birth_Month) %>% unite("Date_of_Birth", Birth_Month, Birth_Day, Birth_Year, sep = "-")
kable(LFP2, caption = "Table 2: Living Former President, Oldest to Youngest, by date.", align = "c")
Table 2: Living Former President, Oldest to Youngest, by date.
| George H. W. Bush |
06-12-1924 |
1989 |
1993 |
1 |
| Jimmy Carter |
10-01-1924 |
1977 |
1981 |
1 |
| George W. Bush |
07-06-1946 |
2001 |
2009 |
2 |
| Bill Clinton |
08-19-1946 |
1993 |
2001 |
2 |
Up
- The table below shows the summary of the living former president by age. From this table we learn that the oldest out of the four to take office is George H. W. Bush at the age of 65, while Bill Clinton is the youngest, who became president at the age of 47.
LFP3 <- Living_Former_President %>% select(President, Age, Age_Start_Term, Age_End_Term, Years_In_Term)
kable(LFP3, caption = "Table 3: Summary of living former president, by age.", align = "c")
Table 3: Summary of living former president, by age.
| George H. W. Bush |
91 |
65 |
69 |
4 |
| Jimmy Carter |
91 |
53 |
57 |
4 |
| George W. Bush |
69 |
55 |
63 |
8 |
| Bill Clinton |
69 |
47 |
55 |
8 |
Up
- The table below show the average age and term in years. The average year of a president to start term is 55 and the average age of all the living former president is 80. The average term in years a president held office is 6.
LFP4 <- Living_Former_President %>% summarise(Age_Started_Term=mean(Age_Start_Term), Age_Ended_Term=mean(Age_End_Term), Age=mean(Age), Years_In_Term=mean(Years_In_Term)) %>% data.frame() %>% gather("Average", "Years", 1:4)
LFP4
## Average Years
## 1 Age_Started_Term 55
## 2 Age_Ended_Term 61
## 3 Age 80
## 4 Years_In_Term 6
kable(LFP4, caption = "Table 4: The four living former presidents' summary (average).", align = "c")
Table 4: The four living former presidents’ summary (average).
| Age_Started_Term |
55 |
| Age_Ended_Term |
61 |
| Age |
80 |
| Years_In_Term |
6 |
Up