library(readr)
library(tidyr)
library(tidyverse)
library(dplyr)
library(lubridate)
library(rvest)
library(stringr)
library(forecast)
library(MVN)
library(car)
library(outliers)
library(infotheo)
Three datasets that depicted various details about US Presidents were tidied, merged into a single dataset and preprocessed ready for analysis.
Using tidy data principles, an untidy dataset was manipulated to ensure each column contained a single variable, each row contained a single observation and each cell a single value. This involved string manipulation, imputation, grouping and counting of data.
Merging used a common key between all three datasets and ensured that all observations were retained.
Preprocessing included applying data type conversions for ease of handling; manipulating data to form new variables that may be useful for the user; scanning the data for obvious inconsistencies; testing data for outliers and handling them appropriately; and transforming certain variables into a normal distribution for ease of testing.
The resultant dataset is complete, tidy and consistent, and should enable users and analysts to glean valuable information about the history of US Presidents.
Pres1 is a presidential timeline,1 imported using the read_csv() function. Variables include:
Index (President number): Character
Name (President’s first and surname): Character
Birth (Date of birth): Character
Death (Date of death, if applicable): Character
TermBegin (Date of commencement of Presidency): Character
TermEnd (Date of commencement of Presidency): Character.
Pres1 <- read_csv("President_timeline.csv")
head(Pres1)
Pres2 details more information the Presidents of the United States, scraped from a table on Britannica2 using code from github3.
Variables include:
no (President number): Character
president (First and surname): Character
birthplace (State of birth, abbreviated): Character
political party (Party of affiliation): Character
term (Year range of presidency): Character.
webpage2 <- read_html("https://www.britannica.com/topic/Presidents-of-the-United-States-1846696")
tbls_ls2 <- webpage2 %>% html_nodes("table") %>% .[1] %>% html_table(fill = TRUE)
Pres2 <- tbls_ls2[[1]]
head(Pres2)
Pets is a dataset scraped from a table detailing the pets of the past Presidents of the United States4 using the same code3.
Variables include:
Rank (Highest number of Presidents to own that pet type): Character
# (President number): Character
President (Type of pet, and Presidents who own them): Character
# of Presidents (Number of Presidents to own that pet type): Character.
webpage <- read_html("https://www.potus.com/presidential-facts/types-of-pets/")
tbls_ls <- webpage %>% html_nodes("table") %>% .[1] %>% html_table(fill = TRUE)
Pets <- tbls_ls[[1]]
head(Pets)
In order to merge these three datasets, they need to all be in a form where the key is the President number. The datasets can’t be merged based on Name, as one President served twice - Grover Cleveland was the 22nd and 24th President of the United States - and the value is not unique.
Before merging the data, the column names were changed in Pres1 and Pres2 for clarification of variables. Pres2 was subsetted to remove the first two rows - which were footnote references - and to remove the first and last column. The first column was empty and the last column listed term dates, which are already detailed in Pres1.
colnames(Pres1) <- c("PresNo", "President", "Birth", "Death", "TermBegin", "TermEnd")
Pres2 <- Pres2[(2:5)]
Pres2 <- Pres2[-(1:2),]
colnames(Pres2) <- c("PresNo", "President", "Birthplace", "PoliticalParty")
Pets was in an untidy format, and had to be tidied before merging. The way the data was displayed meant that more than one variable was in a column (Pet types and President names were both in the “President” column), and the “# of Presidents” column had only a count next to each pet type and the rest of the values were NA.
head(Pets, 5)
tail(Pets, 5)
To tidy the data so that each column has its own variable, each line its own observation and each cell its own value, the following steps were taken:
The “# of Presidents” column was renamed “Pets”;
The pet values were imputed from the “President” column into the “Pets” column, next to each President whose name originally appeared under that pet.
A string manipulation was performed on the pet types to make them singular, which included use of the str_sub() function for the pet names that were plural with an “s”, and a manual imputation of those with irregular plurals (eg. from “mice” to “mouse”).
Pets types that were only owned by one president were imputed with a value called “Other”.
The redundant pet types were then removed from the President column using the filter() function and a subset.
colnames(Pets) <- c("Rank", "PresNo", "President", "Pet")
Pets$Pet[2:33] <- Pets$President[1] %>% str_sub(start = 1, end = -2)
Pets$Pet[35:50] <- Pets$President[34] %>% str_sub(start = 1, end = -2)
Pets$Pet[52:65] <- Pets$President[51]%>% str_sub(start = 1, end = -2)
Pets$Pet[67:78] <- Pets$President[66]%>% str_sub(start = 1, end = -2)
Pets$Pet[80:83] <- Pets$President[79]%>% str_sub(start = 1, end = -2)
Pets$Pet[85:87] <- Pets$President[84]%>% str_sub(start = 1, end = -2)
Pets$Pet[89:91] <- Pets$President[88]%>% str_sub(start = 1, end = -2)
Pets$Pet[93:95] <- Pets$President[92]%>% str_sub(start = 1, end = -2)
Pets$Pet[97:98] <- Pets$President[96]%>% str_sub(start = 1, end = -2)
Pets$Pet[100:101] <- Pets$President[99]%>% str_sub(start = 1, end = -2)
Pets$Pet[103:104] <- Pets$President[102]%>% str_sub(start = 1, end = -2)
Pets$President[seq(105, 143, 2)] <- str_sub(Pets$President[seq(105, 143, 2)], start = 1, end = -2)
Pets$President[115] <- "Hippopotamus"
Pets$President[123] <- "Mouse"
Pets$President[135] <- "Sheep"
Pets$President[143] <- "Wallaby"
Pets$Pet[seq(106, 144, 2)] <- "Other"
Pets <- Pets %>% filter(Rank <1)
Pets <- Pets[-114,]
head(Pets)
A new column, “n” - number of pets - was created after combining the group_by() and count() functions. The data was then arranged in ascending order of “n” to check the values of the two Presidents who had not owned any pets.
Pets <- Pets %>% select(PresNo, President, Pet) %>% group_by(PresNo, President) %>% count()
Pets <- Pets %>% arrange(n)
head(Pets)
After checking the head of the data, there were no “0” values for n as the NA values had been grouped and were counted as “1”. The two Presidents who did not own any pets had “0” values imputed. The columns were renamed into the same format as the other two data sets - PresNo, President and PetCount.
Pets$n[c(1,11)] <- "0"
colnames(Pets) <- c("PresNo", "President", "PetCount")
head(Pets)
Grover Cleveland’s PresNo was changed from “22 & 24” to “22” so that each cell has its own value and is considered tidy and ready to merge.
Pets$PresNo[21] <- "22"
While most data type conversions will be completed after merging the datasets, the PresNo variable in Pets and Pres2 had to be converted to the same data type in order for it to be used as the key to the join. They were both converted to numeric using the as.numeric() function.
Pets$PresNo <- as.numeric(Pets$PresNo)
Pres2$PresNo <- as.numeric(Pres2$PresNo)
The Pres1 and Pres2 datasets were merged to create Pres12, using the left_join() function with “PresNo” as the key. A subset of Pres2 was used, removing the President column as it already exists in Pres1.
Pres12 <- left_join(Pres1, Pres2[-2], by = "PresNo")
Pets was then merged with Pres12 using the left_join() function, and again using “PresNo” as the key and a subset of Pets to remove the President column.
Pres3 <- left_join(Pres12, Pets[-2], by = "PresNo")
head(Pres3)
After merging all three datasets, the structure of Pres3 was checked using the str() function:
str(Pres3)
## tibble [45 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ PresNo : num [1:45] 1 2 3 4 5 6 7 8 9 10 ...
## $ President : chr [1:45] "George Washington" "John Adams" "Thomas Jefferson" "James Madison" ...
## $ Birth : chr [1:45] "22 February 1732" "30 October 1735" "13 April 1743" "16 March 1751" ...
## $ Death : chr [1:45] "14 December 1799" "4 July 1826" "4 July 1826" "28 June 1836" ...
## $ TermBegin : chr [1:45] "30 April 1789" "4 March 1797" "4 March 1801" "4 March 1809" ...
## $ TermEnd : chr [1:45] "4 March 1997" "4 March 1801" "4 March 1809" "4 March 1817" ...
## $ Birthplace : chr [1:45] "Va." "Mass." "Va." "Va." ...
## $ PoliticalParty: chr [1:45] "Federalist" "Federalist" "Democratic-Republican" "Democratic-Republican" ...
## $ PetCount : chr [1:45] "4" "2" "4" "1" ...
## - attr(*, "spec")=
## .. cols(
## .. Index = col_double(),
## .. Name = col_character(),
## .. Birth = col_character(),
## .. Death = col_character(),
## .. TermBegin = col_character(),
## .. TermEnd = col_character()
## .. )
The following conversions were made:
Birth, Death, TermBegin and TermEnd were converted to date using the as.Date() function with format %d $b %Y;
PoliticalParty was factored checked using the unique(), factor() and which(is.na()) functions;
PetCount was converted to numeric using the as.numeric() function.
Pres3$Birth <- as.Date(Pres3$Birth, format = "%d %b %Y")
Pres3$Death <- as.Date(Pres3$Death, format = "%d %b %Y")
Pres3$TermBegin <- as.Date(Pres3$TermBegin, format = "%d %b %Y")
Pres3$TermEnd <- as.Date(Pres3$TermEnd, format = "%d %b %Y")
unique(Pres2$PoliticalParty)
## [1] "Federalist" "Democratic-Republican" "National Republican"
## [4] "Democratic" "Whig" "Republican"
## [7] "Democratic (Union)"
Pres3$PoliticalParty <- factor(Pres3$PoliticalParty, labels = c("Federalist", "Democratic-Republican",
"National Republican", "Democratic", "Whig", "Republican", "Democratic (Union)"), levels = c("Federalist",
"Democratic-Republican", "National Republican", "Democratic", "Whig", "Republican", "Democratic (Union)"))
which(is.na(Pres2$PoliticalParty))
## integer(0)
Pres3$PetCount <- as.numeric(Pres3$PetCount)
The variables of Pres3 are now:
PresNo (Number of President): Numeric
President (Name and surname) : Character
Birth (Date of birth): Date
Death (Date of death, if applicable): Date
TermBegin (Date of commencement of Presidency): Date
TermEnd (Date of commencement of Presidency): Date
Birthplace (State of birth, abbreviated): Character
Political party (Party of affiliation): Factor
PetCount (Number of pets owned): Numeric.
Three new variables were created using the mutate() function. DiedAged became the difference between Birth and Death dates, with NA values for Presidents who are still alive. AgeInaugurated became the difference between the Birth and TermBegin dates. These variables referring to age were converted to years using the time_length() function and were rounded to full numbers.
TermLength was renamed TermLength_yrs, and was also converted to years using time_length(), but was rounded to two decimal places so that any anomalies aside from the typical 4 and 8 year terms served could be noted.
Pres3 <- Pres3 %>% mutate(DiedAged = Death-Birth)
Pres3$DiedAged <- Pres3$DiedAged %>% time_length(unit = "year") %>% round(0)
Pres3 <- Pres3 %>% mutate(TermLength_yrs = TermEnd-TermBegin)
Pres3$TermLength_yrs <- Pres3$TermLength_yrs %>% time_length(unit = "year") %>% round(2)
Pres3 <- Pres3 %>% mutate(AgeInaugurated = TermBegin-Birth)
Pres3$AgeInaugurated <- Pres3$AgeInaugurated %>% time_length(unit = "year") %>% round(0)
head(Pres3[8:12])
The new variables of Pres3 are:
AgeInaugurated (Age at the start of Presidential term): Numeric
TermLength_yrs (Length of Presidential term to 2dp): Numeric
DiedAged (Age at the start of Presidential term): Numeric.
The variables were then reordered in a logical order for the viewer.
Pres3 <- Pres3 %>% select(PresNo, President, PoliticalParty, Birth, Death, TermBegin, TermEnd,
AgeInaugurated, TermLength_yrs, DiedAged, PetCount)
Upon an initial scan, an obvious error in the data became apparent after creating TermLength_yrs. The first value, George Washington’s term, is listed as 207.84 years and is an obvious error. This was rectified by manually imputing the date of his end term to from 1997-03-04 to 1797-03-04. This was double-checked by asking if it is the same date that the second President’s term begins.
Pres3$TermEnd[1] <- as.Date("1797-03-04")
Pres3$TermEnd[1] == Pres3$TermBegin[2]
## [1] TRUE
The value in TermLength_yrs was then imputed using the sum of a subset, followed by the time_length() and round() functions. The head of the data was then checked to confirm the change.
Pres3$TermLength_yrs[1] <- (Pres3$TermEnd[1]-Pres3$TermBegin[1]) %>% time_length(unit = "year") %>% round(2)
While this error was picked up during a visual scan, it has highlighted the importance of scanning each column for obvious errors and inconsistencies.
First, the PresNo column is scanned using a combination of which() and is.na() functions. The variable was checked using the filter() and dim() functions to ensure there were no numbers less than 1, greater than 45 or that no two Presidents had the same number.
which(is.na(Pres3$PresNo))
## integer(0)
Pres3 %>% filter(PresNo <1) %>% dim()
## [1] 0 11
Pres3 %>% filter(PresNo > 45) %>% dim()
## [1] 0 11
Pres3 %>% count(PresNo) %>% filter(n>1) %>% dim()
## [1] 0 2
Scan for NA values in the President variable:
which(is.na(Pres3$President))
## integer(0)
Scan for NA values in the PoliticalParty variable:
which(is.na(Pres3$PoliticalParty))
## integer(0)
Scan for NA values in the Birth variable:
which(is.na(Pres3$Birth))
## integer(0)
Manually scan the Birth column for inconsistencies:
Pres3$Birth
## [1] "1732-02-22" "1735-10-30" "1743-04-13" "1751-03-16" "1758-04-28"
## [6] "1767-07-11" "1767-03-15" "1782-12-05" "1773-02-09" "1790-03-29"
## [11] "1795-11-02" "1784-11-24" "1800-01-07" "1804-11-23" "1791-04-23"
## [16] "1809-02-12" "1808-12-29" "1822-04-27" "1822-10-04" "1831-11-19"
## [21] "1829-10-05" "1837-03-18" "1833-08-20" "1837-03-18" "1843-01-29"
## [26] "1858-10-27" "1857-09-15" "1858-12-28" "1865-11-02" "1872-07-04"
## [31] "1874-08-10" "1882-01-30" "1884-05-08" "1890-10-14" "1917-05-29"
## [36] "1908-08-27" "1913-01-09" "1913-07-14" "1924-10-01" "1911-02-06"
## [41] "1924-06-12" "1946-08-19" "1946-07-06" "1961-08-04" "1946-06-14"
Scan for NA values in the Death variable:
na_death <- which(is.na(Pres3$Death))
na_death
## [1] 39 41 42 43 44 45
The six NA values for date of death were checked manually:
Pres3$President[na_death]
## [1] "James Earl Carter" "George Herbert Walker Bush"
## [3] "William Jefferson Clinton" "George Walker Bush"
## [5] "Barack Obama" "Donald Trump"
George H W Bush died on 30 November 2018, so this value was imputed into the Death variable.5 The age of death was updated using the sum and the time_length() function.
Pres3$Death[41] <- as.Date("2018-11-30")
Pres3$DiedAged[41] <- (Pres3$Death[41]-Pres3$Birth[41]) %>% time_length(unit = "year") %>% round(0)
Pres3[41,c(2,5,10)]
Manually scan the Death column for inconsistencies:
Pres3$Death
## [1] "1799-12-14" "1826-07-04" "1826-07-04" "1836-06-28" "1831-07-04"
## [6] "1848-02-23" "1845-06-08" "1862-07-24" "1841-04-04" "1862-01-18"
## [11] "1849-06-15" "1850-07-09" "1874-05-08" "1869-10-08" "1868-06-01"
## [16] "1865-04-15" "1875-07-31" "1885-07-23" "1893-01-17" "1881-09-19"
## [21] "1886-11-18" "1908-06-24" "1901-03-13" "1908-06-24" "1901-09-14"
## [26] "1919-01-06" "1930-03-08" "1924-02-03" "1923-08-02" "1933-01-05"
## [31] "1964-10-20" "1945-04-12" "1972-12-26" "1969-03-28" "1963-11-22"
## [36] "1973-01-22" "1994-04-22" "2006-12-26" NA "2004-06-05"
## [41] "2018-11-30" NA NA NA NA
Scan for NA values in the TermBegin variable:
which(is.na(Pres3$TermBegin))
## integer(0)
Scan for NA values in the TermEnd variable:
which(is.na(Pres3$TermEnd))
## integer(0)
Expecting an NA value for the 45th President, Donald Trump, this observation was checked manually:
Pres3[45,1:7]
As the term of US Presidents is set to the day, this date will be correct once Trump has fulfilled his term. However, as it is a future date and it is possible for a President to not finish their time in office, the decision was made to replace both the TermEnd date and TermLength_yrs with an NA value.
Pres3$TermEnd[45] <- NA
Pres3$TermLength_yrs[45] <- NA
Scan for missing or special values in the AgeInaugurated variable:
which(is.na(Pres3$AgeInaugurated))
## integer(0)
which(is.nan(Pres3$AgeInaugurated))
## integer(0)
Check for any ususual values in the AgeInaugurated variable:
Pres3 %>% filter(AgeInaugurated >75)
Pres3 %>% filter(AgeInaugurated <40)
Scan for missing or special values in the TermLength_yrs variable:
which(is.na(Pres3$TermLength_yrs))
## [1] 45
which(is.nan(Pres3$TermLength_yrs))
## integer(0)
Check for unusual values in the TermLength_yrs variable:
Pres3 %>% filter(TermLength_yrs <4)
Pres3 %>% filter(TermLength_yrs >8)
All unusual values were verified using public sources.6
Scan for missing or special values in the DiedAged variable:
pres_alive <- which(is.na(Pres3$DiedAged))
pres_alive
## [1] 39 42 43 44 45
which(is.nan(Pres3$DiedAged))
## integer(0)
Double check these values:
Pres3[pres_alive,]
These values are consistent with Presidents that are still alive.
Check for unusual values in the DiedAged variable:
Pres3 %>% filter(DiedAged <45)
Pres3 %>% filter(DiedAged >95)
Scan for missing or special values in the PetCount variable:
which(is.na(Pres3$PetCount))
## [1] 24
which(is.nan(Pres3$PetCount))
## integer(0)
Check this observation for inconsistencies:
Pres3[24,]
The NA value in PetCount is for Grover Cleveland, who was the president who served two non-consecutive terms and whose PresNo was imputed earlier as “22”. As the observation detailing his second term is a separate observation, the decision was made to manually impute his PetCount from his first term.
Pres3$PetCount[24] <- Pres3$PetCount[22]
Check that PetCount remains numeric after imputation using the is.numeric() function and update using as.numeric().
is.numeric(Pres3$PetCount)
## [1] TRUE
Pres3$PetCount <- as.numeric(Pres3$PetCount)
Check for unusual values in the PetCount variable:
Pres3 %>% filter(PetCount >8)
Theodore Roosevelt and Calvin Coolidge both had PetCount values >8, but this may not have been unusual in the 1800s and is unlikely to be an error.
Determine which variables are numeric in order to scan for outliers.7
nums <- unlist(lapply(Pres3, is.numeric))
colnames(Pres3[ , nums])
## [1] "PresNo" "AgeInaugurated" "TermLength_yrs" "DiedAged"
## [5] "PetCount"
PresNo is a list from 1 to 45 and is not a measurement. It does not need to be scanned.
Scan for outliers in the AgeInaugurated variable using the hist(), boxplot() and qqPlot() functions.
hist(Pres3$AgeInaugurated)
boxplot(Pres3$AgeInaugurated)
qqPlot(Pres3$AgeInaugurated)
## [1] 45 40
Given that the histogram is close to normal, the boxplot shows no outliers and the qqPlot shows outliers still inside the curve, the decision has been made not to remove the outliers.
Scan for outliers in the TermLength_yrs variable using the hist(), boxplot() and qqPlot() functions.
hist(Pres3$TermLength_yrs)
boxplot(Pres3$TermLength_yrs)
qqPlot(Pres3$TermLength_yrs)
## [1] 32 9
Given the standardisation of term length in the US constitution, it is not surprising that most of the values are 4 of 8 years. The 9th and 32nd values in the column TermLength_yrs are outliers according to the Q-Q Plot, but not according to the boxplot. The decision has been made to leave these outliers, and they are of historical significance and therefore are valuable to anyone using this dataset.
Scan for outliers in the DiedAged variable using the hist(), boxplot() and qqPlot() functions.
hist(Pres3$DiedAged)
boxplot(Pres3$DiedAged)
qqPlot(Pres3$DiedAged)
## [1] 35 41
Given that the boxplot shows no outliers and the qqPlot shows outliers still inside the curve, the decision has been made not to remove the outliers shown in the QQplot.
Scan for outliers in the PetCount variable using the hist(), boxplot() and qqPlot() functions.
hist(Pres3$PetCount)
boxplot(Pres3$PetCount)
qqPlot(Pres3$PetCount)
## [1] 26 30
Looking at the histogram, boxplot and QQ-Plot, there are two outliers that are, visually, well outside of the other data.
Test the variable PetCount to see if these cases are “extreme” - more extreme than 3 times the Interquartile Range above or below Q3 or Q1.
Q1 <- quantile(Pres3$PetCount,probs = .25,na.rm = TRUE)
Q1
## 25%
## 1
Q3 <- quantile(Pres3$PetCount,probs = .75,na.rm = TRUE)
Q3
## 75%
## 3
IQR <- IQR(Pres3$PetCount)
Upper <- sum(Q3+(IQR*3))
Lower <- sum(Q1-(IQR*3))
Upper
## [1] 9
Lower
## [1] -5
which(Pres3$PetCount > Upper)
## [1] 26 30
which(Pres3$PetCount < Lower)
## integer(0)
The decision was made to cap extreme outliers.8
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 3*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 3*IQR(x) ] <- quantiles[4]
x
}
Pres3$PetCount <- Pres3$PetCount %>% cap()
Pres3$PetCount
## [1] 4 2 4 1 1 2 2 1 2 3 0 1 1 2 2 6 1 2 2 2 2 2 3 2 2 6 2 4 2 6 2 1 1 2 6 3 1 2
## [39] 2 3 1 2 3 1 0
The variable that has shown the least normal data and the most exteme outliers is PetCount. Analyse the capped PetCount variable for normality and outliers using the hist(), boxplot() and qqPlot() functions.
hist(Pres3$PetCount)
boxplot(Pres3$PetCount)
qqPlot(Pres3$PetCount)
## [1] 16 26
While there are still Q-Q Plot outliers present after the capping, the outliers are not as extreme. Rather than imputing or capping another set of outliers, the decision has been made to transform this data using numerical methods.
As the data is skewed to the right, use the sqrt() function to transform the variable and save as PetCountNormal.
Pres3$PetCountNormal <- sqrt(Pres3$PetCount)
Check the output of the transformation using hist() and qqPlot() functions.
hist(Pres3$PetCountNormal, breaks = 6)
qqPlot(Pres3$PetCountNormal)
## [1] 11 45
Saving PetCountNormal as a new variable allows the analyst to choose to either quote true historical data using PetCount, or to utilise PetCountNormal as a normally distributed variable for testing.
The Pres1, Pres2 and Pets datasets were able to be tidied, merged and cleaned into a single dataset named Pres3, holding a number of details about US Presidents.
The approach resulted in finding some inconsistencies and errors in the dataset, and these were able to to be rectified to make the data clean and therefore more useful to the analyst.
Handling outliers appropriately and making data transformations as required meant the data is an accurate reflection of the historical events that it depicts, while many variables are still able to be tested as normal distributions.
Ultimately, a thorough approach has resulted in a complete, tidy and consistent dataset that is ready for perusal or analysis.
University of South Carolina 2018, Presidents: president_timeline.csv, University of South Carolina, accessed 1 October 2020, https://people.math.sc.edu/Burkardt/datasets/presidents/president_timelines.csv.
Amy Tikkanen 2019, Presidents of the United States, Encyclopaedia Britannica, accessed 4 October 2020, https://www.britannica.com/topic/Presidents-of-the-United-States-1846696.
Bradley Boehmke 2015, Scraping HTML Tables, Bradley Boehmke, accessed 2 October 2020, http://bradleyboehmke.github.io/2015/12/scraping-html-tables.html.
Robert S Summers 2020, Presidents of the United States Types of Pets, Robert S Summers, accessed 6 October 2020, https://www.potus.com/presidential-facts/types-of-pets/.
Wikipedia 2020, Death and state funeral of George H. W. Bush, Wikimedia Foundation Inc, accessed 6 October 2020, https://en.wikipedia.org/wiki/Death_and_state_funeral_of_George_H._W._Bush.
Wikipedia 2020, List of presidents of the United States by time in office, Wikimedia Foundaton Inc, accessed 6 October 2020, https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States_by_time_in_office.
Mdsumner 2018, Selecting only numeric columns from a data frame, Stack Overflow, accessed 7 October 2020, https://stackoverflow.com/questions/5863097/selecting-only-numeric-columns-from-a-data-frame.
Dr. Anil Dolgun 2020, Transform: Data Transformation, Standardisation, and Reduction, RMIT, accessed 7 October 2020, http://rare-phoenix-161610.appspot.com/secured/Module_07.html.