library(readr)
library(foreign)
library(gdata)
## gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.
##
## gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.
##
## Attaching package: 'gdata'
## The following object is masked from 'package:stats':
##
## nobs
## The following object is masked from 'package:utils':
##
## object.size
## The following object is masked from 'package:base':
##
## startsWith
library(rvest)
##
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
##
## guess_encoding
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:gdata':
##
## combine, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(outliers)
library(MVN)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## sROC 0.1-2 loaded
library(ggplot2)
library(knitr)
library(deductive)
library(validate)
##
## Attaching package: 'validate'
## The following object is masked from 'package:ggplot2':
##
## expr
## The following object is masked from 'package:dplyr':
##
## expr
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
The two data sets in used are US President Physical data, and US President data. The US President Physical data includes data on each US President that has been in office. The data contains information about their physical attributes like weight and height, as well as important dates. The US President data includes more information about their career and the important dates in their presidency.
In the below steps two data frames, US President Pysical Data and US President data were first merged into one data frame, presidentsInfo. presidentsInfo was then inspected to ensure each attribute was the correct type. Type conversions were preformed on attributed that needed to be converted. Once attributes were converted to the correct type, the data frame was tidied by combining attibutes to ensure tidy data principals. A new data frame was created, called usPresidents, so remove redundant attributes from the data frame. Two new variables were added to the data frame using the manipulate function. usPresidents was scanned for NA values. After identifying and inspecting NA values they were replaced according to best practice. All numeric attributes of the data frame was scanned for outliers, and were dealt with using the appropriate techniques. Finally data transformations were applied to attributes to decrease skewness and convert the distribution into a normal distribution.
US President Physical Data (physical) Source: https://www.kaggle.com/atmcfarland/historical-us-president-physical-data-more Variables: order - The order in which the presidents were in office name -The first and last name of each president height_cm - President’s height in centimeters height_in - President’s height in inches weight_kg - President’s weightmin kilograms weight_lb - President’s weight in pounds body_mass_index - President’s body mass index (kg/m^2) body_mass_index_range - States if the president’s BMI range as Underweight, Normal, Overweight, or Obese birth_day - Day of the month each President was born birth_month - Month the president was born birth_year - Year the president was born birth_date - Birthday of the president (dd-mm-yyyy) birthplace - City president was born birth_state - State president was born death_day - Day of the month the president died death_month - Month the president died death_year - Year the president died death_date - Date the president died (dd-mm-yyyy) death_age - Age the president was when they died astrological_sign -President;s astorlogical sign term_begin_day - Day of the month the president’s term started term_begin_month - Month president’s term started term_begin_year - Year president’s term started term_begin_date - Date the president’s term started (dd-mm-yyyy) term_end_day - Day of the month the president’s term ended term_end_month - Month president’s term ended term_end_year - Year president’s term ended term_end_date - Date the president’s term ended (dd-mm-yyyy) presidency_begin_age - Age of the president when they began their term presidency_end_age - Age of the president when they ended ended their term political_party - The political party the president belongs to corrected_iq - Corrected IQ of the president (taken from a study in 2006)
*Presidents who served more than one term, term start date and age both reference their first presidency and the term end date and age reference their second presidency (therefore their first term end date/age and second term start date/age are not present in the dataset)
US Presidents Data (presidents) Source: https://www.kaggle.com/harshitagpt/us-presidents start - Date when the president’s term started end - Date when the president’s term started president - Name of the president prior - Name of the president before becoming president party - The political party the president belongs to vice - The Vice President during the term
The two above data sets were opened using read.csv. The presidents data set’s second variable was rename to “order” as it is the same data as the physical data’s attribute “order” and this was the attribute in which the data sets were to be joined on. Both order attributes were converted to numerics to ensure they were of the same type and could be used as the joining key. An inner join was used on the two data sets to only observations in both sets were desired. The joined data set was renamed as presidentInfo.
setwd('/Users/gracywhelihan/Desktop/RMIT/DataWrangling/Assessments')
physical <- read.csv("PresidentPhysicalData.csv")
head(physical)
presidents <- read.csv("us_presidents.csv")
head(presidents)
#Renamed the
names(presidents)[2] <- "order"
physical$order <- as.numeric(physical$order)
presidents$order <- as.numeric(presidents$order)
presidentInfo <- inner_join(physical, presidents, key="order")
## Joining, by = "order"
The presidentInfo data set was inspected using str(). After inspection of each attribute, it was observed that there were attributes of type numeric, character, and integer. There were multiple attributes that needed to be converted. The body_mass_index_range attribute was converted from character to factor with 4 levels. The date attributes, birth_date,term_begin_date, term_end_date, and death_date were all converted from type character to date in the format DD-MM-YYYY. The political_party, astrological_sign, and birth_state attributes were also converted from character to factors, with 7, 12, 50 levels, respectively.
str(presidentInfo)
## 'data.frame': 45 obs. of 39 variables:
## $ order : num 1 2 3 4 5 6 7 8 9 10 ...
## $ name : chr "George Washington" "John Adams" "Thomas Jefferson" "James Madison" ...
## $ height_cm : int 188 170 189 163 183 171 185 168 173 183 ...
## $ height_in : num 74 67 74.5 64 72 67.5 73 66 68 72 ...
## $ weight_kg : num 79.4 83.9 82.1 55.3 85.7 92.1 63.5 74.4 73.5 72.6 ...
## $ weight_lb : int 175 185 181 122 189 203 140 164 162 160 ...
## $ body_mass_index : num 22.5 29 23 20.8 25.6 31.5 18.6 26.4 24.6 21.7 ...
## $ body_mass_index_range: chr "Normal" "Overweight" "Normal" "Normal" ...
## $ birth_day : int 22 30 13 16 28 11 15 5 9 29 ...
## $ birth_month : int 2 10 4 3 4 7 3 12 2 3 ...
## $ birth_year : int 1732 1735 1743 1751 1758 1767 1767 1782 1773 1790 ...
## $ birth_date : chr "22-02-1732" "30-10-1735" "13-04-1743" "16-03-1751" ...
## $ birthplace : chr "Westmoreland County" "Braintree" "Shadwell" "Port Conway" ...
## $ birth_state : chr "Virginia" "Massachusetts" "Virginia" "Virginia" ...
## $ death_day : int 14 4 4 28 4 23 8 24 4 18 ...
## $ death_month : int 12 7 7 6 7 2 6 7 4 1 ...
## $ death_year : int 1799 1826 1826 1836 1831 1848 1845 1862 1841 1862 ...
## $ death_date : chr "14-12-1799" "04-07-1826" "04-07-1826" "28-06-1836" ...
## $ death_age : int 67 90 83 85 73 80 78 79 68 71 ...
## $ astrological_sign : chr "Pisces" "Scorpio" "Aries" "Pisces" ...
## $ term_begin_day : int 30 4 4 4 4 4 4 4 4 4 ...
## $ term_begin_month : int 4 3 3 3 3 3 3 3 3 4 ...
## $ term_begin_year : int 1789 1797 1801 1809 1817 1825 1829 1837 1841 1841 ...
## $ term_begin_date : chr "30-04-1789" "04-03-1797" "04-03-1801" "04-03-1809" ...
## $ term_end_day : int 4 4 4 4 4 4 4 4 4 4 ...
## $ term_end_month : int 3 3 3 3 3 3 3 3 4 3 ...
## $ term_end_year : int 1797 1801 1809 1817 1825 1829 1837 1841 1841 1845 ...
## $ term_end_date : chr "04-03-1797" "04-03-1801" "04-03-1809" "04-03-1817" ...
## $ presidency_begin_age : int 57 61 57 57 58 57 61 54 68 51 ...
## $ presidency_end_age : int 65 65 65 65 66 61 69 58 68 54 ...
## $ political_party : chr "Unaffiliated" "Federalist" "Democratic-Republican" "Democratic-Republican" ...
## $ corrected_iq : int 140 155 160 160 139 175 145 146 146 148 ...
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ start : chr "April 30, 1789" "March 4, 1797" "March 4, 1801" "March 4, 1809" ...
## $ end : chr "March 4, 1797" "March 4, 1801" "March 4, 1809" "March 4, 1817" ...
## $ president : chr "George Washington" "John Adams" "Thomas Jefferson" "James Madison" ...
## $ prior : chr "Commander-in-Chief of the Continental Army ( 1775–1783 )" "1st Vice President of the United States" "2nd Vice President of the United States" "5th United States Secretary of State (1801–1809)" ...
## $ party : chr "Nonpartisan [13]" "Federalist" "Democratic- Republican" "Democratic- Republican" ...
## $ vice : chr "John Adams" "Thomas Jefferson" "Aaron Burr" "George Clinton" ...
#convert body_mass_index_range into a factor
presidentInfo$body_mass_index_range <- presidentInfo$body_mass_index_range %>%
factor(levels=c("Normal", "Overweight", "Obese", "Morbidly Obese"), labels = c("Normal", "Overweight", "Obese", "Morbidly Obese"))
#convert all of the date variable from character to date
presidentInfo$birth_date <- as.Date(presidentInfo$birth_date, "%d-%m-%Y")
presidentInfo$term_begin_date <- as.Date(presidentInfo$term_begin_date, "%d-%m-%Y")
presidentInfo$term_end_date <- as.Date(presidentInfo$term_end_date, "%d-%m-%Y")
presidentInfo$death_date <- as.Date(presidentInfo$death_date, "%d-%m-%Y")
#convert political_party from character to factor
presidentInfo$political_party <- presidentInfo$political_party %>%
factor(levels=c( "Democrat", "Republican", "Unaffiliated", "Federalist", "Democratic-Republican", "Whig", "National Union"), labels = c( "Democrat", "Republican", "Unaffiliated", "Federalist", "Democratic-Republican", "Whig", "National Union"))
#convert astrological_sign from a character to a factor
presidentInfo$astrological_sign <- presidentInfo$astrological_sign %>%
factor(levels=c("Aquarius", "Pisces", "Aries", "Taurus", "Gemini", "Cancer", "Leo", "Virgo", "Libran", "Scorpio", "Sagittarius", "Capricorn"), labels = c( "Aquarius", "Pisces", "Aries", "Taurus", "Gemini", "Cancer", "Leo", "Virgo", "Libra", "Scorpio", "Sagittarius", "Capricorn"))
#convert birth_state from character to factor
presidentInfo$birth_state <- presidentInfo$birth_state %>%
factor(levels=c("Alabama","Alaska","Arizona","Arkansas","California","Colorado","Connecticut","Delaware","Florida","Georgia","Hawaii","Idaho","Illinois","Indiana","Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland","Massachusetts","Michigan","Minnesota","Mississippi","Missouri","Montana","Nebraska","Nevada","New Hampshire","New Jersey","New Mexico","New York","North Carolina","North Dakota","Ohio","Oklahoma","Oregon","Pennsylvania","Rhode Island","South Carolina","South Dakota","Tennessee","Texas","Utah","Vermont","Virginia","Washington","West Virginia","Wisconsin","Wyoming"), labels = c("Alabama","Alaska","Arizona","Arkansas","California","Colorado","Connecticut","Delaware","Florida","Georgia","Hawaii","Idaho","Illinois","Indiana","Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland","Massachusetts","Michigan","Minnesota","Mississippi","Missouri","Montana","Nebraska","Nevada","New Hampshire","New Jersey","New Mexico","New York","North Carolina","North Dakota","Ohio","Oklahoma","Oregon","Pennsylvania","Rhode Island","South Carolina","South Dakota","Tennessee","Texas","Utah","Vermont","Virginia","Washington","West Virginia","Wisconsin","Wyoming"))
In this merged data set, presidentInfo, the birth_day, birth_month, birth_year, death_day, death_month, death_year, term_begin_day, term_begin_month, term_begin_year, term_end_day, term_end_month, term_end_year attributes contain the dates of each presidents birth, death, begin date of presidency, and end date of presidency. This does not conform to the tidy data principal that states, each attribute must have it’s own column, because all of these dates, day, month, and year should be reflected in one column. Therefore, the birth, death, term_begin, and term_end dates need to be combined into one attribute to represent each date, birth_date_combined, death_date_combined, term_begin_date_combined, and term_end_date_combined. This will be executed by using the unite() function from tidyr.
There are also redundant attributes that will be filtered out so the data sett only includes each attribute once. This data set will be called usPresidents, and will use the select() function from the dplyr package to include only one of each variable. The birth_date_combined, death_date_combined, term_begin_date_combined, and term_end_date_combined, start, end, name, and party variables were not selected for the usPresidents data frame because they contained information that is the same, but maybe in a different format, as order, president, birth_date, death_date, term_begin_date, term_end_date, and political party.
presidentInfo <- presidentInfo %>% unite(birth_date_combined, birth_day, birth_month, birth_year)
presidentInfo <- presidentInfo %>% unite(death_date_combined, death_day, death_month, death_year)
presidentInfo<- presidentInfo %>% unite(term_begin_date_combined, term_begin_day, term_begin_month, term_begin_year)
presidentInfo <- presidentInfo %>% unite(term_end_date_combined, term_end_day, term_end_month, term_end_year)
usPresidents <- presidentInfo %>% select(order, president, height_cm, height_in, weight_kg, weight_lb, body_mass_index, body_mass_index_range, birth_date, birthplace, birth_state, death_date, death_age, astrological_sign, term_begin_date, term_end_date, presidency_begin_age, presidency_end_age, political_party, corrected_iq, prior, vice)
An attribute days_in_office is created using the mutate() function from the dplyr package. This attribute is created by subtracting the term_begin_date from the term_end_date to get the number of days each president sent in office. Another attribute called years_in_office is created to represent the number of years each president spent in office by dividing the time_in_office variable by 365. The variable time_in_office must be separated into two variables, days_in_office which will contain the number of days, and the unit, which is days. The unit does not need to be included in the data set because the attribute name days_in_office reflects that the unit is in terms of days. The attribute days_in_office is of type character and it must be an integer, so it is converted to an interger type. Then it can be divided by 365 to create the variable years_in_office.
usPresidents <- mutate(usPresidents, time_in_office = term_end_date - term_begin_date)
usPresidents <- usPresidents %>% separate(time_in_office, into = c("days_in_office"), sep = " ")
usPresidents$days_in_office <- as.numeric(usPresidents$days_in_office)
usPresidents <- mutate(usPresidents, years_in_office = days_in_office/365)
head(usPresidents)
Scan the data for missing values, special values and obvious errors (i.e. inconsistencies). In this step, you should fulfil the minimum requirement #7. In addition to the R codes and outputs, explain your methodology (i.e. explain why you have chosen that methodology and the actions that you have taken to handle these values) and communicate your results clearly.
The is.na() function was used along with looking at the data set, and showed there are 9 NA values. The NA values were found in the body_mass_index_range, death_age, and corrected_iq.
For the death_age the NA values will be replaced with the average 71.025. Although these presidents with the NA values in death_age are still alive and their actual death age will be older, as is the case for all except Barack Obama as he is the only president younger than 71.025 at this time. Using the mean will make clear that these presidents are still alive, and the death_age data is not effected by them at this time.
For corrected_iq the NA values will also be replaced with the mean value for corrected_iq. The corrected_iq variable was calculated before Trump and Biden were in office, therefore using the mean value will let us more accurately reflect the corrected_iq attributed for the presidents who have recorded IQs. The mean here is rounded because IQs are reflected as whole numbers.
For body_mass_index_range the two NA values are from the same president, who served two non consecutive terms. Also, the body_mass_index_range is based off of the body_mass_index, if the bmi is between 30 and 40 that means the bmi range is “Obese”. For these NA values the body_mass_index is 36.4 therefore the NA values should be replaced with the level labeled “Obese”. If there were more NA values like this it would be best to make a rule and impute it using the deductive and validate packages, but because there are only to NA values and they both need to be replaced with “Obese” they were replaced using is.na.
For the death_date variable there are blank values, as these presidents have not died yet. These will be left blank as blank cells are different than NA, and this indicates these presidents are still alive.
is.na(usPresidents)
## order president height_cm height_in weight_kg weight_lb body_mass_index
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [16,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [17,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [20,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [21,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [22,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [24,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [26,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [27,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [29,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [30,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [31,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [32,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [33,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [35,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [36,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [38,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [39,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [40,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [41,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [42,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [43,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [44,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## body_mass_index_range birth_date birthplace birth_state death_date
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE FALSE FALSE
## [8,] FALSE FALSE FALSE FALSE FALSE
## [9,] FALSE FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE
## [12,] FALSE FALSE FALSE FALSE FALSE
## [13,] FALSE FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE FALSE
## [16,] FALSE FALSE FALSE FALSE FALSE
## [17,] FALSE FALSE FALSE FALSE FALSE
## [18,] FALSE FALSE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE FALSE
## [20,] FALSE FALSE FALSE FALSE FALSE
## [21,] FALSE FALSE FALSE FALSE FALSE
## [22,] TRUE FALSE FALSE FALSE FALSE
## [23,] FALSE FALSE FALSE FALSE FALSE
## [24,] TRUE FALSE FALSE FALSE FALSE
## [25,] FALSE FALSE FALSE FALSE FALSE
## [26,] FALSE FALSE FALSE FALSE FALSE
## [27,] FALSE FALSE FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE FALSE FALSE
## [29,] FALSE FALSE FALSE FALSE FALSE
## [30,] FALSE FALSE FALSE FALSE FALSE
## [31,] FALSE FALSE FALSE FALSE FALSE
## [32,] FALSE FALSE FALSE FALSE FALSE
## [33,] FALSE FALSE FALSE FALSE FALSE
## [34,] FALSE FALSE FALSE FALSE FALSE
## [35,] FALSE FALSE FALSE FALSE FALSE
## [36,] FALSE FALSE FALSE FALSE FALSE
## [37,] FALSE FALSE FALSE FALSE FALSE
## [38,] FALSE FALSE FALSE FALSE FALSE
## [39,] FALSE FALSE FALSE FALSE TRUE
## [40,] FALSE FALSE FALSE FALSE FALSE
## [41,] FALSE FALSE FALSE FALSE FALSE
## [42,] FALSE FALSE FALSE FALSE TRUE
## [43,] FALSE FALSE FALSE FALSE TRUE
## [44,] FALSE FALSE FALSE FALSE TRUE
## [45,] FALSE FALSE FALSE FALSE TRUE
## death_age astrological_sign term_begin_date term_end_date
## [1,] FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE FALSE
## [8,] FALSE FALSE FALSE FALSE
## [9,] FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE
## [12,] FALSE FALSE FALSE FALSE
## [13,] FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE
## [16,] FALSE FALSE FALSE FALSE
## [17,] FALSE FALSE FALSE FALSE
## [18,] FALSE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE
## [20,] FALSE FALSE FALSE FALSE
## [21,] FALSE FALSE FALSE FALSE
## [22,] FALSE FALSE FALSE FALSE
## [23,] FALSE FALSE FALSE FALSE
## [24,] FALSE FALSE FALSE FALSE
## [25,] FALSE FALSE FALSE FALSE
## [26,] FALSE FALSE FALSE FALSE
## [27,] FALSE FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE FALSE
## [29,] FALSE FALSE FALSE FALSE
## [30,] FALSE FALSE FALSE FALSE
## [31,] FALSE FALSE FALSE FALSE
## [32,] FALSE FALSE FALSE FALSE
## [33,] FALSE FALSE FALSE FALSE
## [34,] FALSE FALSE FALSE FALSE
## [35,] FALSE FALSE FALSE FALSE
## [36,] FALSE FALSE FALSE FALSE
## [37,] FALSE FALSE FALSE FALSE
## [38,] FALSE FALSE FALSE FALSE
## [39,] TRUE FALSE FALSE FALSE
## [40,] FALSE FALSE FALSE FALSE
## [41,] FALSE FALSE FALSE FALSE
## [42,] TRUE FALSE FALSE FALSE
## [43,] TRUE FALSE FALSE FALSE
## [44,] TRUE FALSE FALSE FALSE
## [45,] TRUE FALSE FALSE FALSE
## presidency_begin_age presidency_end_age political_party corrected_iq
## [1,] FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE FALSE
## [8,] FALSE FALSE FALSE FALSE
## [9,] FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE
## [12,] FALSE FALSE FALSE FALSE
## [13,] FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE
## [16,] FALSE FALSE FALSE FALSE
## [17,] FALSE FALSE FALSE FALSE
## [18,] FALSE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE
## [20,] FALSE FALSE FALSE FALSE
## [21,] FALSE FALSE FALSE FALSE
## [22,] FALSE FALSE FALSE FALSE
## [23,] FALSE FALSE FALSE FALSE
## [24,] FALSE FALSE FALSE FALSE
## [25,] FALSE FALSE FALSE FALSE
## [26,] FALSE FALSE FALSE FALSE
## [27,] FALSE FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE FALSE
## [29,] FALSE FALSE FALSE FALSE
## [30,] FALSE FALSE FALSE FALSE
## [31,] FALSE FALSE FALSE FALSE
## [32,] FALSE FALSE FALSE FALSE
## [33,] FALSE FALSE FALSE FALSE
## [34,] FALSE FALSE FALSE FALSE
## [35,] FALSE FALSE FALSE FALSE
## [36,] FALSE FALSE FALSE FALSE
## [37,] FALSE FALSE FALSE FALSE
## [38,] FALSE FALSE FALSE FALSE
## [39,] FALSE FALSE FALSE FALSE
## [40,] FALSE FALSE FALSE FALSE
## [41,] FALSE FALSE FALSE FALSE
## [42,] FALSE FALSE FALSE FALSE
## [43,] FALSE FALSE FALSE FALSE
## [44,] FALSE FALSE FALSE TRUE
## [45,] FALSE FALSE FALSE TRUE
## prior vice days_in_office years_in_office
## [1,] FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE FALSE
## [8,] FALSE FALSE FALSE FALSE
## [9,] FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE
## [12,] FALSE FALSE FALSE FALSE
## [13,] FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE
## [16,] FALSE FALSE FALSE FALSE
## [17,] FALSE FALSE FALSE FALSE
## [18,] FALSE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE
## [20,] FALSE FALSE FALSE FALSE
## [21,] FALSE FALSE FALSE FALSE
## [22,] FALSE FALSE FALSE FALSE
## [23,] FALSE FALSE FALSE FALSE
## [24,] FALSE FALSE FALSE FALSE
## [25,] FALSE FALSE FALSE FALSE
## [26,] FALSE FALSE FALSE FALSE
## [27,] FALSE FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE FALSE
## [29,] FALSE FALSE FALSE FALSE
## [30,] FALSE FALSE FALSE FALSE
## [31,] FALSE FALSE FALSE FALSE
## [32,] FALSE FALSE FALSE FALSE
## [33,] FALSE FALSE FALSE FALSE
## [34,] FALSE FALSE FALSE FALSE
## [35,] FALSE FALSE FALSE FALSE
## [36,] FALSE FALSE FALSE FALSE
## [37,] FALSE FALSE FALSE FALSE
## [38,] FALSE FALSE FALSE FALSE
## [39,] FALSE FALSE FALSE FALSE
## [40,] FALSE FALSE FALSE FALSE
## [41,] FALSE FALSE FALSE FALSE
## [42,] FALSE FALSE FALSE FALSE
## [43,] FALSE FALSE FALSE FALSE
## [44,] FALSE FALSE FALSE FALSE
## [45,] FALSE FALSE FALSE FALSE
colSums(is.na(usPresidents))
## order president height_cm
## 0 0 0
## height_in weight_kg weight_lb
## 0 0 0
## body_mass_index body_mass_index_range birth_date
## 0 2 0
## birthplace birth_state death_date
## 0 0 5
## death_age astrological_sign term_begin_date
## 5 0 0
## term_end_date presidency_begin_age presidency_end_age
## 0 0 0
## political_party corrected_iq prior
## 0 2 0
## vice days_in_office years_in_office
## 0 0 0
which(is.na(usPresidents$body_mass_index_range))
## [1] 22 24
usPresidents$death_age[is.na(usPresidents$death_age)] <- mean(usPresidents$death_age, na.rm = TRUE)
usPresidents$corrected_iq[is.na(usPresidents$corrected_iq)] <- round(mean(usPresidents$corrected_iq, na.rm = TRUE))
usPresidents$body_mass_index_range[is.na(usPresidents$body_mass_index_range)]<-"Obese"
Scan the numeric data for outliers. In this step, you should fulfil the minimum requirement #8. In addition to the R codes and outputs, explain your methodology (i.e. explain why you have chosen that methodology and the actions that you have taken to handle these values) and communicate your results clearly.
Boxplots of each attribute were used to determine outliers for numeric attributes in the usPresidents data set. The numeric attributes with outliers are weight_km, weight_lb, body_mass_index, presidency_begin_age, and corrected_iq. After inspecting each variable it was determined that none of the outliers in the variable were results of data entry or processing. Therefore capping was used to preplace the outliers with the nearest value that is not an outlier. In the case of this data set it may be valuable to keep the outliers for these variables because there are so few and they are accurately entered. Keeping the ouliers could be interesting as they show the anomalies with president’s weight and IQ, but instead capping will be used. A user defined capping function was used, along with a subset of the attributes that have outlier.
#usPresidents$height_cm %>% boxplot(main="Boxplot of Presidents Height (cm)", ylab="Height (cm)", col="pink")
#usPresidents$height_in %>% boxplot(main="Boxplot of Presidents Height (in)", ylab="Height (cm)", col="pink")
usPresidents$weight_kg %>% boxplot(main="Boxplot of Presidents Weight (kg)", ylab="Weight (kg)", col="pink")
usPresidents$weight_lb %>% boxplot(main="Boxplot of Presidents Weight (lb)", ylab="Weight (lb)", col="pink")
#usPresidents$body_mass_index %>% boxplot(main="Boxplot of BMI in Presidents", ylab="BMI", col = "blue")
#usPresidents$death_age %>% boxplot(main="Boxplot of Presidents age at death", ylab="Age at Death", col = "grey")
#usPresidents$presidency_begin_age %>% boxplot(main="Boxplot of Presidency Begin Age", ylab="Presidency Begin Age", col="green")
#usPresidents$presidency_end_age %>% boxplot(main="Boxplot of Presidency End Age", ylab="Presidency End Age", col="red")
#usPresidents$corrected_iq %>% boxplot(main="Boxplot of President Corrected IQ", ylab="Corrected IQ", col="yellow")
usPresidents$days_in_office %>% boxplot(main="Boxplot of Presidents Days In Office", ylab="Days in Office", col="orange")
#usPresidents$years_in_office %>% boxplot(main="Boxplot of Presidents Years In Office", ylab="Years in Office", col="purple")
boxplot(usPresidents$days_in_office ~ usPresidents$political_party, main="Days in Office by Political Party", ylab = "Days in office", xlab = "Political Party")
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x}
usPresidents_sub <- usPresidents %>% select(weight_kg, weight_lb, body_mass_index, presidency_begin_age, corrected_iq)
usPresidents_capped <- sapply(usPresidents_sub, cap)
summary(usPresidents_capped)
## weight_kg weight_lb body_mass_index presidency_begin_age
## Min. : 55.30 Min. :122.0 Min. :18.60 Min. :42.00
## 1st Qu.: 75.70 1st Qu.:167.0 1st Qu.:23.40 1st Qu.:51.00
## Median : 81.60 Median :180.0 Median :25.50 Median :54.00
## Mean : 84.75 Mean :186.8 Mean :26.18 Mean :54.73
## 3rd Qu.: 90.70 3rd Qu.:200.0 3rd Qu.:28.60 3rd Qu.:58.00
## Max. :115.82 Max. :255.4 Max. :36.40 Max. :68.00
## corrected_iq
## Min. :130.0
## 1st Qu.:141.0
## Median :145.0
## Mean :146.4
## 3rd Qu.:151.0
## Max. :160.0
The ln, log, sqrt, and reciprocal transformations were performed on the weight_kg data because it is right skewed, and these transformations are used to reduce right skewness. The reciprocal transformation worked the best for decreasing the skewness of the weight_kg attribute.
The Box-Cox transformation was used on the attribute days_in_office because this attribute is not normally distributed. After performing the Box-Cox transformation the days_in_office is more normally distributed, which is important for performing statistical statistical hypothesis tests.
# This is the R chunk for the Transform Section
usPresidents$weight_kg %>% hist(main="Boxplot of Presidents Weight (kg)", ylab="Weight (kg)", col="grey")
ln_weight_kg <- log(usPresidents$weight_kg)
hist(ln_weight_kg)
log_weight_kg <- log10(usPresidents$body_mass_index)
hist(log_weight_kg)
recip_weight_kg <- 1/usPresidents$weight_kg #worked the best
hist(recip_weight_kg)
usPresidents$days_in_office %>% hist(main="Boxplot of Presidents Days In Office", ylab="Days in Office", col="orange")
box_cox_days_in_office <- BoxCox(usPresidents$days_in_office,lambda = "auto")
hist(box_cox_days_in_office)