library(readr)
library(readxl)
library(magrittr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(forecast)
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Registered S3 methods overwritten by 'forecast':
## method from
## fitted.fracdiff fracdiff
## residuals.fracdiff fracdiff
My requirements were to create a data set for Gun Violence and Poverty by State for 2015 in the United States that reflects Male & Female participation in the gun violence events and total numbers of those killed or injured. To do this I merged the two data sets described below by way of left joins with the assistance of an intermediate data set. The data types were examined and changed where necessary. The date was already in Date format. States were converted to unordered factors from characters, doubles to integers, variables renamed and the poverty data grouped and summed by state. Irrelevant information was eliminated. The gun violence data was untidy and required mutation into Male & Female variables. The killed & injured variables were mutated to a single killed/injured variable. Missing data primarily occurred in the Male & Female variables. A small number where the corresponding killed/injured were greater than two were deleted. The remainder were updated with median values. The poverty data set had one instance of missing data which was deleted. At this point the gun violence data was grouped and summed by state and all data sets merged by way of left joins to the gun violence data. The final four numerical variables were scanned for outliers. 8% of the outliers were generated by the poverty data. For the purpose of the assignment they were deleted, however in a real world situation I would have found alternate methods of dealing with them. Considerable analysis and manipulation would be required which is beyond the scope of this assignment. 4% of the outliers were generated from the remaining variables and eliminated. Finally a transformation of the killed/injured variable was carried out. I found that applying a root of 3.61 came closest to transforming the data into a normal distribution.
The requirement that at least two data sets be merged will be carried out in the Tidy & Manipulate II Section as in this particular instance it is far more efficient to structure each data set so that they may be merged with a join rather than merge first with ‘cbind’ to create a data set with mismatched observations.
The Gun Violence Data was obtained from http://www.kaggle.com The Data Set was compiled by James Ko. The original data was sourced from: http://www.gunviolencearchive.org/
The dataset contains the following variables:
incident_id: an incident ID date: date of the incident state: State city_or_county: City or County address: Address n_killed: Number of People killed n_injured: Number of People injured incident_url: A URL for the incident source_url: A URL for the source of the information incident_url_fields_missing: Boolean for missing incident URL fields congressional_district: Congressional District gun_stolen: Gun Stolen gun_type: Type of Gun incident_characteristics: Characteristics of the incident latitude: Latitude of the incident location_description: Description of the location longitude: Longitude of the incident n_guns_involved: Number of guns involved notes: Notes participant_age: Age of incident participant participant_age_group: Age group of the incident participant participant_gender: Gender of the participant participant_name: Name of the participant participant_relationship: Relationship of the participant participant_status: Status of the participant participant_type: Type of participant sources: Sources state_house_district: State House District state_senate_district: State Senate District
The Poverty Estimates Data Set was obtained from http://www.kaggle.com
The dataset contains the following variables: FIPS Code: FIPStxt State-County FIPS Code
State: State Abbreviation
Area_name: Area name
POVALL_2015: Estimate of people of all ages in poverty 2015 CI90LBAll_2015: 90% confidence interval lower bound of estimate of people of all ages in poverty 2015
CI90UBALL_2015: 90% confidence interval upper bound of estimate of people of all ages in poverty 2015
PCTPOVALL_2015: Estimated percent of people of all ages in poverty 2015 CI90LBALLP_2015: 90% confidence interval lower bound of estimate of percent of people of all ages in poverty 2015
CI90UBALLP_2015: 90% confidence interval upper bound of estimate of percent of people of all ages in poverty 2015
POV017_2015: Estimate of people age 0-17 in poverty 2015 CI90LB017_2015: 90% confidence interval lower bound of estimate of people age 0-17 in poverty 2015
CI90UB017_2015: 90% confidence interval upper bound of estimate of people age 0-17 in poverty 2015
PCTPOV017_2015: Estimated percent of people age 0-17 in poverty 2015
CI90LB017P_2015: 90% confidence interval lower bound of estimate of percent of people age 0-17 in poverty 2015
CI90UB017P_2015: 90% confidence interval upper bound of estimate of percent of people age 0-17 in poverty 2015
POV517_2015: Estimate of related children age 5-17 in families in poverty 2015
CI90LB517_2015: 90% confidence interval lower bound of estimate of related children age 5-17 in families in poverty 2015
CI90UB517_2015: 90% confidence interval upper bound of estimate of related children age 5-17 in families in poverty 2015
PCTPOV517_2015: Estimated percent of related children age 5-17 in families in poverty 2015
CI90LB517P_2015: 90% confidence interval lower bound of estimate of percent of related children age 5-17 in families in poverty 2015 CI90UB517P_2015: 90% confidence interval upper bound of estimate of percent of related children age 5-17 in families in poverty 2015 MEDHHINC_2015: Estimate of median household income 2015
CI90LBINC_2015: 90% confidence interval lower bound of estimate of median household income 2015 CI90UBINC_2015: 90% confidence interval upper bound of estimate of median household income 2015 POV05_2015: Estimate of people under age 5 in poverty 2015 (available for the U.S. and State total only)
CI90LB05_2015: 90% confidence interval lower bound of estimate of people under age 5 in poverty 2015
CI90UB05_2015: 90% confidence interval upper bound of estimate of people under age 5 in poverty 2015
PCTPOV05_2015: Estimated percent of people under age 5 in poverty 2015 CI90LB05P_2015: 90% confidence interval lower bound of estimate of percent of people under age 5 in poverty 2015
CI90UB05P_2015: 90% confidence interval upper bound of estimate of percent of people under age 5 in poverty 2015
Sources: Census Bureau Population Estimates: http://www.census.gov/did/www/saipe/
USDA, Economic Research Service, Rural Classifications: http://www.ers.usda.gov/topics/rural-economy-population/rural-classifications.aspx
Contact: Tim Parker tparker@ers.usda.gov
The US States dataset was compiled by myself from information sourced from: https://abbreviations.yourdictionary.com/articles/state-abbrev.html
# Import the gun violence data, check dimensions and view the initial 6 rows
gun_violence_data <- read_csv("gun-violence-data.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## incident_id = col_double(),
## date = col_date(format = ""),
## n_killed = col_double(),
## n_injured = col_double(),
## incident_url_fields_missing = col_logical(),
## congressional_district = col_double(),
## latitude = col_double(),
## longitude = col_double(),
## n_guns_involved = col_double(),
## state_house_district = col_double(),
## state_senate_district = col_double()
## )
## See spec(...) for full column specifications.
dim(gun_violence_data)
## [1] 239677 29
head(gun_violence_data)
# Import the poverty estimates data, check dimensions and view the initial 6 rows
PovertyEstimates <- read_excel("PovertyEstimates.xls",
sheet = "Poverty Data 2015")
dim(PovertyEstimates)
## [1] 3194 24
head(PovertyEstimates)
# Import the us states data, check dimensions and view the initial 6 rows
us_states <- read_csv("us_states.csv")
## Parsed with column specification:
## cols(
## State = col_character(),
## Abbr = col_character()
## )
dim(us_states)
## [1] 59 2
head(us_states)
For each data set I will elimate some variables and observations as they will not be required for my final data set. It is preferable to elimate this data first to avoid any conflicts when re-structuring the required data. As the final data set will be comparing gun violence incidents and gender participation in 2015 with poverty estimates for each state in the United States in that same year.
With respect to the Gun Violence Data set, I will eliminate all variables other than ‘date’, ‘state’, ‘n_killed’, ‘n_injured’, ‘participant_gender’. I will eliminate all observations other than those that occurred in 2015. Date will also be removed as it will no longer be required, but not until the data has been tidied.
‘date’ is already in Date format, so no need for any data type conversions. ‘state’ is a character type. It will remain this way until the tidy process. ‘n_killed’ and ‘n_injured’ will be converted to integers. ‘participant_gender’ is of character type. It is untidy and will be dealt with later.
With respect to the Poverty Estimates data set, I will eliminate all variables other than ‘State’ and ‘POVALL_2015’. I will remove the first observation as it is a total of all observations. The states will be grouped together and a sum of the poverty level for each state will be taken. ‘State’ has the data type of character. I will change this to an unordered factor as it is a categorical variable.“POVALL_2015” is a double. As it is unlikely only half a person will be in poverty, I will change the data type of double to integer. Finally, I will rename the variable “POVALL_2015” to “Poverty” for convenience and readability.
# Subset for variables "date", "state", "n_killed", "n_injured", "participant_gender".
gv <- gun_violence_data[, c("date", "state", "n_killed", "n_injured", "participant_gender")]
# Check the data types
str(gv)
## Classes 'tbl_df', 'tbl' and 'data.frame': 239677 obs. of 5 variables:
## $ date : Date, format: "2013-01-01" "2013-01-01" ...
## $ state : chr "Pennsylvania" "California" "Ohio" "Colorado" ...
## $ n_killed : num 0 1 1 4 2 4 5 0 0 1 ...
## $ n_injured : num 4 3 3 0 2 0 0 5 4 6 ...
## $ participant_gender: chr "0::Male||1::Male||3::Male||4::Female" "0::Male" "0::Male||1::Male||2::Male||3::Male||4::Male" "0::Female||1::Male||2::Male||3::Male" ...
# The date data type is Date, so no need to make any data type conversions for date. However, there are observations other than for the year 2015, so I will subset for these, leaving only observations for 2015.
gv <- gv[gv$date >= '2015-01-01' & gv$date <= '2015-12-31', ]
# Change the data type of "n_killed" to an integer.
gv$n_killed <- as.integer(gv$n_killed)
# Change the data type of "n_injured" to an integer.
gv$n_injured <- as.integer(gv$n_injured)
# Change the variable name of "state" to "State"
colnames(gv)[colnames(gv)=="state"] <- "State"
# Check the data types
str(gv)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53579 obs. of 5 variables:
## $ date : Date, format: "2015-01-01" "2015-01-01" ...
## $ State : chr "Oklahoma" "Louisiana" "Mississippi" "Alabama" ...
## $ n_killed : int 0 1 0 0 0 0 0 1 0 0 ...
## $ n_injured : int 2 0 0 2 1 1 2 0 1 0 ...
## $ participant_gender: chr "0::Female||1::Male" "0::Male" "0::Male" "0::Male||1::Male" ...
# Subset for variables "State" and "POVALL_2015", taking all observations other than the first one as it is a total for the entire united states.
pe <- PovertyEstimates[2:3194 , c("State", "POVALL_2015")]
# Group the data by state taking the sum of the number of people in poverty for each state in 2015
pe <- pe %>%
group_by(State) %>%
summarise(POVALL_2015 = sum(POVALL_2015))
# Check the data types
str(pe)
## Classes 'tbl_df', 'tbl' and 'data.frame': 51 obs. of 2 variables:
## $ State : chr "AK" "AL" "AR" "AZ" ...
## $ POVALL_2015: num 149882 1751707 1081460 2318092 11792510 ...
# Change the data type of State to an unordered factor
pe$State <- factor(pe$State, levels = us_states$Abbr)
# Change the data type of "POVALL_2015" to an integer
pe$POVALL_2015 <- as.integer(pe$POVALL_2015)
# Rename the column "POVALL_2015" to to "Poverty"
colnames(pe)[colnames(pe)=="POVALL_2015"] <- "Poverty"
# Rename the column "State" to to "Abbr"
colnames(pe)[colnames(pe)=="State"] <- "Abbr"
# Confirm the data manipulations
str(pe)
## Classes 'tbl_df', 'tbl' and 'data.frame': 51 obs. of 2 variables:
## $ Abbr : Factor w/ 59 levels "AK","AL","AR",..: 1 2 3 5 6 7 8 9 10 11 ...
## $ Poverty: int 149882 1751707 1081460 2318092 11792510 1228819 735734 226370 231310 6258121 ...
For the data to be tidy: Each variable must have it’s own column. Each observation must have it’s own row. Each value must have its own cell. As can be seen by the content of ‘participant_gender’ it contains character type data similar to this: 0::Female||1::Female||2::Female||3::Female||4::Male||5::Male The information captured in the cell is the number of males and females that were participants in the gun violence incident, regardless of whether they were victims or suspects. To tidy this data, I will create two new variables, namely ‘Male’ and ‘Female’. The observations these variables will contain is the number of males and females that participated in the gun violence incident. I will then remove the following variables: “date”, “participant_gender” as they are no longer required. The data is now tidy.
# This function takes the variable's observation and returns the number of times the specified gender appears in the string. In the event the observation is NA, NA is returned.
gendercount <- function(x, gender) {
y <- ifelse(is.na(x), NA, length(unlist(str_extract_all(x, regex(gender)))))
y
}
# A new variable, 'Male' is created and the number of males participating in the gun violence incident is entered as the observation.
gv <- gv %>% rowwise() %>%
mutate(Male = gendercount(participant_gender, "Male"))
# A new variable, 'Female' is created and the number of females participating in the gun violence incident is entered as the observation.
gv <- gv %>% rowwise() %>%
mutate(Female = gendercount(participant_gender, "Female"))
# Remove 'date' and 'participant_gender'
gv <- gv[ , c("State", "n_killed", "n_injured", "Male", "Female")]
# Check the dataframe.
dim(gv)
## [1] 53579 5
head(gv)
While already having fulfilled this requirement above, I will create a new variable ‘Killed_Injured’ which will be the total of ‘n_killed’ and ‘n_injured’ and then remove ‘n_killed’ and ‘n_injured’ as they will be no longer required.
# Create the variable 'Killed_Injured from the sum of 'n_killed' and 'n_injured'.
gv <- gv %>% rowwise() %>%
mutate(Killed_Injured = n_killed + n_injured)
# Subset the data set to removed the redundant variables of 'n_killed' and 'n_injured'
gv <- gv[ , c("State", "Killed_Injured", "Male", "Female")]
# Check the dataframe.
dim(gv)
## [1] 53579 4
head(gv)
First I will scan each variable of the Gun Violence dataset and then each variable of the Poverty dataset.
The Gun Violence data set had 8479 missing values for ‘Male’ and 8479 missing values for ‘Female’. A check of incomplete cases returned 8479. So wherever there was a missing value for ‘Male’, the same observation also contains a missing value for ‘Female’. The missing values represented 15.8% of the data, so deleting the data is not an option as it is too great a percentage of the total data.
111 of the observations with missing values had more than 2 people killed or injured. This represented 0.2% of the data. This number of entries was a small enough percentage to delete. I deleted those entries.
While I think a much more detailed analysis would be required to make a more accurate replacement of the missing values, for the purpose of the assignment I will demonstrate the use of the impute() method in the Hmisc package to deal with the remaining missing values. I used the function to replace the missing values with the median value as an integer was required.
The Poverty data set has one missing value. This value is at index 12. I will remove this observation as there are too many factors involved in creating the observation to rely on some means of extrapolation such as taking the mean or median.
Now that there are no more missing values to deal with, this is an appropriate time to merge the data sets. First of all I will group and sum all observations for each state in the Gun Violence data set. After which I will merge all data sets. The priority is the gun violence data, so any data not matching the gun violence data can be discarded. To that end I will left join us_states to the gun violence data set on State, then I will left join the poverty data set to the gun violence data set on Abbr.
This leaves me with the Gun Violence Poverty Data Set for 2015 by State and meets my initial requirements.
# Scan the Gun Violence data set.
# Scan 'State'
sum(is.na(gv$State))
## [1] 0
# Scan 'Killed_Injured'
sum(is.na(gv$Killed_Injured))
## [1] 0
# Scan 'Male'
sum(is.na(gv$Male))
## [1] 8479
# Scan 'Female'
sum(is.na(gv$Female))
## [1] 8479
# Check to see how many observations had missing values.
sum(!complete.cases(gv))
## [1] 8479
# Check to see how many observations had missing values with more than 2 people killed or injured.
sum(!complete.cases(gv) & gv$Killed_Injured > 2)
## [1] 111
# Delete observations that contained missing values and had 2 or more people killed or injured.
gv <- gv[!(gv$Killed_Injured > 2 & is.na(gv$Male)),]
# Check to see how many observations had missing values with more than 2 people killed or injured.
sum(!complete.cases(gv) & gv$Killed_Injured > 2)
## [1] 0
# Replace the missing values of the male variable with the median value as it needs to be an integer.
gv$Male <- impute(gv$Male, fun = median)
# Replace the missing values of the female variable with the median value as it needs to be an integer.
gv$Female <- impute(gv$Female, fun = median)
# Scan 'Male'
sum(is.na(gv$Male))
## [1] 0
# Scan 'Female'
sum(is.na(gv$Female))
## [1] 0
# Scan the Poverty data set.
# Scan 'Abbr'
sum(is.na(pe$Abbr))
## [1] 0
# Scan 'Poverty'
sum(is.na(pe$Poverty))
## [1] 1
# There is one missing value in 'Poverty'. Identify the value.
which(is.na(pe$Poverty))
## [1] 12
# Remove the observation with the missing value
pe <- pe[-c(which(is.na(pe$Poverty))), ]
# Re-scan for missing values
sum(is.na(pe$Poverty))
## [1] 0
# Group the data by state taking the sum of the remaining variables
gv <- gv %>%
group_by(State) %>%
summarise(Killed_Injured = sum(Killed_Injured), Male = sum(Male), Female = sum(Female))
## Warning: Grouping rowwise data frame strips rowwise nature
# Change the data type of State to an unordered factor
gv$State <- factor(gv$State, levels = us_states$State)
# Check the data types
str(gv)
## Classes 'tbl_df', 'tbl' and 'data.frame': 51 obs. of 4 variables:
## $ State : Factor w/ 59 levels "Alaska","Alabama",..: 2 1 5 3 6 7 8 10 9 11 ...
## $ Killed_Injured: int 944 151 405 397 2752 452 421 258 460 2317 ...
## $ Male : num 1446 507 625 779 4320 ...
## $ Female : num 239 89 75 137 477 131 110 41 35 497 ...
# Merge the gun violence data set with the us states data set and retain the gun violence data set.
us_states$State <- factor(us_states$State, levels = us_states$State)
gv <- left_join(gv, us_states, by = c('State'))
# Merge the gun violence data set with the poverty data set.
gvp <- left_join(gv, pe, by = c('Abbr'), copy = TRUE)
## Warning: Column `Abbr` joining character vector and factor, coercing into
## character vector
# Tidy up the data types
# Change the data type of "Male" to an integer.
gvp$Male <- as.integer(gvp$Male)
# Change the data type of "Female" to an integer.
gvp$Female <- as.integer(gvp$Female)
# Change "Abbr" to a factor
gvp$Abbr <- factor(gvp$Abbr, levels = us_states$Abbr)
# Re-arrange the column order
gvp <- gvp[, c(1, 5, 2, 3, 4, 6)]
# Check the gun violence poverty data set.
dim(gvp)
## [1] 51 6
head(gvp)
Prior to checking for outliers I will do one final scan for missing values to ensure there are none. As it turns out, there is one missing value in Poverty. As this represents 1.9% of the total data, it is a small enough to delete. In addition to this it is unlikely a truly representative replacement figure could be generated.
I now perform the Shapiro Wilks test on each of the numeric variables to test for normality. The p value in each instance was greatly smaller than 0.05. Had the variables been of a normal distribution, I would have used the z-score test. As they are not, I will use Tukey’s method of outlier detection.
I created a BoxPlot of the ‘Poverty’ variable and detected four outliers. I removed those outliers.
I created a BoxPlot of the ‘Female’ variable and detected one outlier. I removed that outlier.
I created a BoxPlot of the ‘Male’ variable and detected one outlier. I removed that outlier.
I created a BoxPlot of the ‘Killed_Injured’ variable. No outliers were detected.
While the data removed in the outliers is most likely correct, they are extreme cases and need to be analysed on an individual basis as opposed to being looked at in context of the other states. Leaving this data in would have increased error variance, reduced the power of statistical tests and biased estimates of any model parameters the data may be used for.
Outliers are usually due to Data Entry Errors, Measurement Errors, Experimental Errors, Intentional Errors, Data Processing Errors and Sampling Errors. To remove an outlier, it needs to be a small percentage of the overall data. In this instance traversing all variables a total of 12% of the data was removed, which is too high. However 8% of that removed was from poverty figures. The poverty figures cannot be reasonably extrapolated for various reasons, the main one being the difference in population sizes of the states. So it is safer to remove the observations and conduct an analysis in the context of the remaining states.
# Final check for outliers. (Quality Control Check)
# Scan 'State'
sum(is.na(gvp$State))
## [1] 0
# Scan 'Abbr'
sum(is.na(gvp$Abbr))
## [1] 0
# Scan 'Killed_Injured'
sum(is.na(gvp$Killed_Injured))
## [1] 0
# Scan 'Male'
sum(is.na(gvp$Male))
## [1] 0
# Scan 'Female'
sum(is.na(gvp$Female))
## [1] 0
# Scan 'Poverty'
sum(is.na(gvp$Poverty))
## [1] 1
# Remove the missing value that slipped through
gvp <- gvp[-c(which(is.na(gvp$Poverty))), ]
# Re-check that the missing value in Poverty has been removed
sum(is.na(gvp$Poverty))
## [1] 0
# Test for normality with the Shapiro Wilks Test
shapiro.test((gvp$Killed_Injured))
##
## Shapiro-Wilk normality test
##
## data: (gvp$Killed_Injured)
## W = 0.84382, p-value = 1.057e-05
shapiro.test((gvp$Male))
##
## Shapiro-Wilk normality test
##
## data: (gvp$Male)
## W = 0.85298, p-value = 1.865e-05
shapiro.test((gvp$Female))
##
## Shapiro-Wilk normality test
##
## data: (gvp$Female)
## W = 0.85916, p-value = 2.765e-05
shapiro.test((gvp$Poverty))
##
## Shapiro-Wilk normality test
##
## data: (gvp$Poverty)
## W = 0.68787, p-value = 5.046e-09
# The p value is way lower than 0.05 for each of the variables, so I reject the hypothesis that the representative samples are normal. Instead of using z-score to test for outliers I will use Tukey's Method, namely detecting any values that are beyond the outlier fences in a boxplot. The outlier fences are set at Q1 - (1.5 * IQR) Q3 + (1.5 * IQR) where Q1 and Q3 are the first and third quartiles respectively and IQR is the Inter Quartile Range.
par(mfrow=c(1,4))
# Check for outliers in Poverty
boxplot(gvp$Poverty, main="Boxplot of Poverty")
# First I calculate the 3rd Quantile
q3 <- unname(quantile(gvp$Poverty, 0.75, na.rm = TRUE)[1])
# Then I calculate the IQR
iqr <- IQR(gvp$Poverty, na.rm = TRUE)
# Now I calcuate the upper outlier fence
uof <- q3 + (1.5 * iqr)
# Now I delete any rows with values in the Killed_Injured variable that are greater than the upper outlier fence
for(i in 1:length(gvp$Poverty)){
if(!is.null(gvp$Poverty[i])){
if(!is.na(gvp$Poverty[i])){
if(gvp$Poverty[i] > uof){
print(gvp$Poverty[i])
gvp <- gvp[-i, ]
}
}
}
}
## [1] 11792510
## [1] 6258121
## [1] 5971206
## [1] 8511383
# Check for outliers in Male
boxplot(gvp$Female, main="Boxplot of Female")
# First I calculate the 3rd Quantile
q3 <- unname(quantile(gvp$Female, 0.75, na.rm = TRUE)[1])
# Then I calculate the IQR
iqr <- IQR(gvp$Female, na.rm = TRUE)
# Now I calcuate the upper outlier fence
uof <- q3 + (1.5 * iqr)
# Now I delete any rows with values in the Killed_Injured variable that are greater than the upper outlier fence
for(i in 1:length(gvp$Female)){
if(!is.null(gvp$Female[i])){
if(!is.na(gvp$Female[i])){
if(gvp$Female[i] > uof){
print(gvp$Female[i])
gvp <- gvp[-i, ]
}
}
}
}
## [1] 712
# Check for outliers in the 'Male' variable
boxplot(gvp$Male, main="Boxplot of Male")
# First I calculate the 3rd Quantile
q3 <- unname(quantile(gvp$Male, 0.75, na.rm = TRUE)[1])
# Then I calculate the IQR
iqr <- IQR(gvp$Male, na.rm = TRUE)
# Now I calcuate the upper outlier fence
uof <- q3 + (1.5 * iqr)
# Now I delete any rows with values in the Killed_Injured variable that are greater than the upper outlier fence
for(i in 1:length(gvp$Male)){
if(!is.null(gvp$Male[i])){
if(!is.na(gvp$Male[i])){
if(gvp$Male[i] > uof){
print(gvp$Male[i])
gvp <- gvp[-i, ]
}
}
}
}
## [1] 4919
# Check for outliers in the 'Killed_Injured' variable
boxplot(gvp$Killed_Injured, main="Boxplot of Killed or Injured")
The transformation will be applied to the ‘Killed_Injured’ variable. The purpose of this transformation is to decrease the skewness and convert the distribution into a normal distribution. The first plot displays the distribution of the values in a histogram. It is heavily weighted to the lower end of the scale. I tried applying numerous mathematical operations to the data to find a distribution approaching normal. The best result was with a cube root. By further trialing different values I settled on a root of 3.61 as shown in the second histogram below. The boxplot of the same data shows that it is still slightly skewed to the right. I tried various mathematical operations on the already transformed data, but only to the detriment of the distribution.
par(mfrow=c(1,3))
hist(gvp$Killed_Injured, main = "Histogram of Killed/Injured", xlab = "No. Killed/Injured")
kiNormal <- gvp$Killed_Injured^(1/3.61)
hist(kiNormal, main = "Transform of Killed/Injured", xlab = "Root(3.61) Killed/Injured")
boxplot(kiNormal, main = "Boxplot of Killed/Injured")