The purpose of this project is to gain valuable hand on experience tidying different data sets for the purposes of further downstream data analysis. The project will involve tidying 3 csv files and then analyzing said files.
Data Set 1
The first table comes from the United States Census Bureau. It contains mortality rates for different age groups for different countries that are grouped by sex. This table is in a wide format with column titles splitting the data into age categories and column subtitles splitting each factor further into genders.
#Load the data#MR = Mortality RateMR <-read.csv("https://raw.githubusercontent.com/gc521/DATA-607-Data-Acquisition-Mangement/Project-2/export.csv", skip =1)install.packages("dplyr")
We start the tidying process by dropping a few variables
MR$Row <-NULL#Drop FIPS and GENC columns, as they are provide redundant and unwanted dataMR$FIPS <-NULLMR$GENC <-NULL#Specify which values belong to infants, child under 5, Life Expectancy at Birth (LEAB)MR <- MR %>%rename_("Year"="Year", "Country"="Country.Area.Name", "Both.Sexes.InfantMR"="Both.Sexes", 'Male.InfantMR'='Male', 'Female.InfantMR'='Female', 'Both.Sexes.ChildMR'='Both.Sexes.1', 'Male.ChildMR'='Male.1', 'Female.ChildMR'='Female.1', 'Both.Sexes.Under5'='Both.Sexes.2', 'Male.Under5'='Male.2', 'Female.Under5'='Female.2', 'Both.Sexes.LEAB'='Both.Sexes.3', 'Male.LEAB'='Male.3', 'Female.LEAB'='Female.3')melt_data <-melt(MR, id =c("Country","Year"), variable_name =c("Gender"))
With the data now melted, we can begin basic exploratory data analysis. We can use the dplyr for analysis
We start by averaging the different mortality rates, regardless of year/country
The added column allows to get a sense of which countries had the highest/lowest mortality rates for each gender group.
Another analysis we can do is to rank the countries based on mortality rates of the different age groups
#First covert country to a factor from a character\MR$Country <-as.factor(MR$Country)df <- MR %>%group_by(Country) %>%summarize(mean_size =mean(value, na.rm =TRUE)) %>%arrange(mean_size, .by_group =FALSE)
These query allows us to get a sense of the countries with the best / worst mortality metrics
Data Set 2
The next data set comes from the CDC and it documents earning for health care professions.
#HD == Healthcare dataHD <-read.csv("https://raw.githubusercontent.com/gc521/DATA-607-Data-Acquisition-Mangement/Project-2/hcempl.csv", skip =4)#Some of these rows are incomplete/contain missing values and should be droppedHD <- HD[-1,]HD <- HD[-29,]HD <- HD[-30,]HD <- HD[HD$Occupation.title !="Clinical laboratory technologists and technicians", ]HD <- HD[HD$Occupation.title !="Medical dosimetrists, medical records specialists, and health technologists and technicians, all other" , ]HD <- HD[HD$Occupation.title !="Magnetic resonance imaging technologists\\3" , ]HD <- HD[HD$Occupation.title !="Orderlies\\5", ]HD <- HD[HD$Occupation.title !="Nurse anesthetists\\4", ]HD <- HD[HD$Occupation.title !="Nurse midwives\\4", ]HD <- HD[HD$Occupation.title !="Nurse practitioners\\4", ]HD <- HD |>rename_('Empoyment.Num.2000'='X2000', 'Empoyment.Num.2005'='X2005', 'Empoyment.Num.2009'='X2009', 'Empoyment.Num.2010'='X2010', 'Empoyment.Num.2015'='X2015', 'Empoyment.Num.2016'='X2016', 'Empoyment.Num.2020'='X2020', 'Hourly.Wage.2000'='X2000.1', 'Hourly.Wage.2005'='X2005.1', 'Hourly.Wage.2009'='X2009.1', 'Hourly.Wage.2010'='X2010.1','Hourly.Wage.2015'='X2015.1', 'Hourly.Wage.2016'='X2016.1', 'Hourly.Wage.2020'='X2020.1')
Warning: `rename_()` was deprecated in dplyr 0.7.0.
ℹ Please use `rename()` instead.
HD <- HD |>rename_('EmpoymentNum.2000'='Empoyment.Num.2000', 'EmpoymentNum.2005'='Empoyment.Num.2005', 'EmpoymentNum.2009'='Empoyment.Num.2009', 'EmpoymentNum.2010'='Empoyment.Num.2010', 'EmpoymentNum.2015'='Empoyment.Num.2015', 'EmpoymentNum.2016'='Empoyment.Num.2016', 'EmpoymentNum.2020'='Empoyment.Num.2020', 'HourlyWage.2000'='Hourly.Wage.2000', 'HourlyWage.2005'='Hourly.Wage.2005', 'HourlyWage.2009'='Hourly.Wage.2009', 'HourlyWage.2010'='Hourly.Wage.2010', 'HourlyWage.2015'='Hourly.Wage.2015', 'HourlyWage.2016'='Hourly.Wage.2016', 'HourlyWage.2020'='Hourly.Wage.2020')
Warning: `rename_()` was deprecated in dplyr 0.7.0.
ℹ Please use `rename()` instead.