To choose three data sets that are untidy, read it as a CSV file, tidy and transform that data.
Loaded needed packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
##Data set 1
In the first data set, the data is about the alcohol consumption in 2010 for different countries. The alcohols are seperated into beers, spirit, and wine. To make the data set tidy I will create a column for the types of alcohols servings and another column with the amount of servings to replace ‘beer_servings’, ‘spirit_servings’, ‘wine_servings’.
Data: https://github.com/fivethirtyeight/data/tree/master/alcohol-consumption
## country beer_servings spirit_servings wine_servings
## 1 Afghanistan 0 0 0
## 2 Albania 89 132 54
## 3 Algeria 25 0 14
## 4 Andorra 245 138 312
## 5 Angola 217 57 45
## 6 Antigua & Barbuda 102 128 45
## total_litres_of_pure_alcohol
## 1 0.0
## 2 4.9
## 3 0.7
## 4 12.4
## 5 5.9
## 6 4.9
Adding primary keys with the country names.
DF1 <- tibble::rowid_to_column(DF1, "ID")
head(DF1)
## ID country beer_servings spirit_servings wine_servings
## 1 1 Afghanistan 0 0 0
## 2 2 Albania 89 132 54
## 3 3 Algeria 25 0 14
## 4 4 Andorra 245 138 312
## 5 5 Angola 217 57 45
## 6 6 Antigua & Barbuda 102 128 45
## total_litres_of_pure_alcohol
## 1 0.0
## 2 4.9
## 3 0.7
## 4 12.4
## 5 5.9
## 6 4.9
Tidying: Made the date set into a data frame and used the pivot longer function to make the data longer and replace the columns ‘beer_servings’, ‘spirit_servings’, ‘wine_servings’, and ‘total_litres_of_pure_alcohol’ with ‘alcohol_type_of_serving’ and ‘serving_amount’.
## ID country alcohol_type_of_serving serving_amount
## 1 1 Afghanistan beer_servings 0
## 2 1 Afghanistan spirit_servings 0
## 3 1 Afghanistan wine_servings 0
## 4 1 Afghanistan total_litres_of_pure_alcohol 0
## 5 2 Albania beer_servings 89
## 6 2 Albania spirit_servings 132
#Make data frame as CSV file and test
write_csv(DF1,file="Alcohol_within_countries.csv")
Test1<-read_csv("Alcohol_within_countries.csv", show_col_types = FALSE)
head(Test1)
## # A tibble: 6 × 4
## ID country alcohol_type_of_serving serving_amount
## <dbl> <chr> <chr> <dbl>
## 1 1 Afghanistan beer_servings 0
## 2 1 Afghanistan spirit_servings 0
## 3 1 Afghanistan wine_servings 0
## 4 1 Afghanistan total_litres_of_pure_alcohol 0
## 5 2 Albania beer_servings 89
## 6 2 Albania spirit_servings 132
Now that the data is tidy and transformed, I will perform my analysis to find the country that has the highest alcohol servings consumed. I transformed the data by sorting the serving amounts by descending order.
DF1_max<-DF1[order(DF1$serving_amount, decreasing= TRUE),]
head(DF1_max)
## ID country alcohol_type_of_serving serving_amount
## 274 69 Grenada spirit_servings 438
## 469 118 Namibia beer_servings 376
## 62 16 Belarus spirit_servings 373
## 247 62 France wine_servings 370
## 181 46 Czech Republic beer_servings 361
## 249 63 Gabon beer_servings 347
DF1_max<-filter(DF1_max,serving_amount>=347)#tranformed the data using filter to only display data for serving amount that are greater than or equal to 347.
ggplot(DF1_max, aes(x=country, y=serving_amount, fill=alcohol_type_of_serving)) +
geom_col(position="dodge")
In conclusion I was able to tidy and transform the data on alcohol consumption from different countries and I was able to get that Grenada has the highest alcohol serving consumption, and it’s mainly in teh type of spirit.
##Data set 2
I will work with the data set is on the death causes in New York since 2017. The data set has two columns for different death rates which can be combined to be a death rate type and their values as rates.
Data from:https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam/data_preview
## Year Leading.Cause
## 1 2011 Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)
## 2 2009 Human Immunodeficiency Virus Disease (HIV: B20-B24)
## 3 2009 Chronic Lower Respiratory Diseases (J40-J47)
## 4 2008 Diseases of Heart (I00-I09, I11, I13, I20-I51)
## 5 2009 Alzheimer's Disease (G30)
## 6 2008 Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)
## Sex Race.Ethnicity Deaths Death.Rate Age.Adjusted.Death.Rate
## 1 F Black Non-Hispanic 83 7.9 6.9
## 2 F Hispanic 96 8 8.1
## 3 F Hispanic 155 12.9 16
## 4 F Hispanic 1445 122.3 160.7
## 5 F Asian and Pacific Islander 14 2.5 3.6
## 6 F Asian and Pacific Islander 36 6.8 8.5
Rename:Transform data by rename columns, order to use them in R. By replaceing th ‘.’ with underscores.
DF2<-DF2%>%rename("Leading_cause"=Leading.Cause, "Death_rate"=Death.Rate,"Age_adjusted_death_rate"=Age.Adjusted.Death.Rate, "Race_Ethnicity"=Race.Ethnicity)
head(DF2)
## Year Leading_cause
## 1 2011 Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)
## 2 2009 Human Immunodeficiency Virus Disease (HIV: B20-B24)
## 3 2009 Chronic Lower Respiratory Diseases (J40-J47)
## 4 2008 Diseases of Heart (I00-I09, I11, I13, I20-I51)
## 5 2009 Alzheimer's Disease (G30)
## 6 2008 Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)
## Sex Race_Ethnicity Deaths Death_rate Age_adjusted_death_rate
## 1 F Black Non-Hispanic 83 7.9 6.9
## 2 F Hispanic 96 8 8.1
## 3 F Hispanic 155 12.9 16
## 4 F Hispanic 1445 122.3 160.7
## 5 F Asian and Pacific Islander 14 2.5 3.6
## 6 F Asian and Pacific Islander 36 6.8 8.5
Reorganize that columns to have the leading cause first for the primary keys.
DF2<-DF2[,c(2,1,3,4,5,6,7)]
head(DF2)
## Leading_cause Year
## 1 Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27) 2011
## 2 Human Immunodeficiency Virus Disease (HIV: B20-B24) 2009
## 3 Chronic Lower Respiratory Diseases (J40-J47) 2009
## 4 Diseases of Heart (I00-I09, I11, I13, I20-I51) 2008
## 5 Alzheimer's Disease (G30) 2009
## 6 Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86) 2008
## Sex Race_Ethnicity Deaths Death_rate Age_adjusted_death_rate
## 1 F Black Non-Hispanic 83 7.9 6.9
## 2 F Hispanic 96 8 8.1
## 3 F Hispanic 155 12.9 16
## 4 F Hispanic 1445 122.3 160.7
## 5 F Asian and Pacific Islander 14 2.5 3.6
## 6 F Asian and Pacific Islander 36 6.8 8.5
Create primary keys
DF2 <- tibble::rowid_to_column(DF2, "ID")
head(DF2)
## ID Leading_cause
## 1 1 Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)
## 2 2 Human Immunodeficiency Virus Disease (HIV: B20-B24)
## 3 3 Chronic Lower Respiratory Diseases (J40-J47)
## 4 4 Diseases of Heart (I00-I09, I11, I13, I20-I51)
## 5 5 Alzheimer's Disease (G30)
## 6 6 Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)
## Year Sex Race_Ethnicity Deaths Death_rate Age_adjusted_death_rate
## 1 2011 F Black Non-Hispanic 83 7.9 6.9
## 2 2009 F Hispanic 96 8 8.1
## 3 2009 F Hispanic 155 12.9 16
## 4 2008 F Hispanic 1445 122.3 160.7
## 5 2009 F Asian and Pacific Islander 14 2.5 3.6
## 6 2008 F Asian and Pacific Islander 36 6.8 8.5
DF2[DF2 == '.'] <- NA#Will turn all '.' in NA
Tidy:
DF2<-data.frame(DF2%>%
pivot_longer(c(Death_rate, Age_adjusted_death_rate), names_to = 'Death_rate_type', values_to = 'rate',values_drop_na = TRUE))# values_drop_na turns the missing values implicit when TRUE.
head(DF2)
## ID Leading_cause
## 1 1 Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)
## 2 1 Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)
## 3 2 Human Immunodeficiency Virus Disease (HIV: B20-B24)
## 4 2 Human Immunodeficiency Virus Disease (HIV: B20-B24)
## 5 3 Chronic Lower Respiratory Diseases (J40-J47)
## 6 3 Chronic Lower Respiratory Diseases (J40-J47)
## Year Sex Race_Ethnicity Deaths Death_rate_type rate
## 1 2011 F Black Non-Hispanic 83 Death_rate 7.9
## 2 2011 F Black Non-Hispanic 83 Age_adjusted_death_rate 6.9
## 3 2009 F Hispanic 96 Death_rate 8
## 4 2009 F Hispanic 96 Age_adjusted_death_rate 8.1
## 5 2009 F Hispanic 155 Death_rate 12.9
## 6 2009 F Hispanic 155 Age_adjusted_death_rate 16
Making rates and deaths numeric in order to us in further calculations.
DF2$rate<-as.numeric(DF2$rate)
DF2$Deaths<-as.numeric(DF2$Deaths)
Visualization of the death rate distribution, which seems skewed to the right.
ggplot(DF2, aes(rate)) +
geom_histogram(width = 30)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 686 rows containing non-finite outside the scale range
## (`stat_bin()`).
Turned the data frame into a CSV file.
write_csv(DF2,file="Death_in_NY.csv")
Death_NY <- read_csv("Death_in_NY.csv", show_col_types = FALSE)
head(Death_NY)
## # A tibble: 6 × 8
## ID Leading_cause Year Sex Race_Ethnicity Deaths Death_rate_type rate
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Nephritis, Neph… 2011 F Black Non-His… 83 Death_rate 7.9
## 2 1 Nephritis, Neph… 2011 F Black Non-His… 83 Age_adjusted_d… 6.9
## 3 2 Human Immunodef… 2009 F Hispanic 96 Death_rate 8
## 4 2 Human Immunodef… 2009 F Hispanic 96 Age_adjusted_d… 8.1
## 5 3 Chronic Lower R… 2009 F Hispanic 155 Death_rate 12.9
## 6 3 Chronic Lower R… 2009 F Hispanic 155 Age_adjusted_d… 16
Now that we have the data tidy, I will analyze how the mean of the death rate to see what is the average of the death rates in NY. I will use the summarise function to transform the data.
DF2_Mean<-DF2%>%
summarise(Mean=mean(rate, na.rm = TRUE))#na.rm set at true to remove missing values to not including into the calculations
DF2_Mean
## Mean
## 1 54.93273
DF2[8,]#selecting row for heart disease in hispanic females
## ID Leading_cause Year Sex Race_Ethnicity
## 8 4 Diseases of Heart (I00-I09, I11, I13, I20-I51) 2008 F Hispanic
## Deaths Death_rate_type rate
## 8 1445 Age_adjusted_death_rate 160.7
In conclusion I was able to determine the average of death rate in NY, which is 54.93273. Comparing the rate of age adjusted death rat for heart disease in femal hispanic in 2018, the death rate is well above the average death rate.
##Data set 3
Is a data set on airline and their crashes, which was used for an article to discuss which airlines people should avoid based on their crashes. For this data set there were many columns for diferent types of crashes to tidy up the data a little I combined the crash types into one column and had their values in a seperate column as the crash counts.
Data: https://github.com/fivethirtyeight/data/tree/master/airline-safety
## airline avail_seat_km_per_week incidents_85_99
## 1 Aer Lingus 320906734 2
## 2 Aeroflot* 1197672318 76
## 3 Aerolineas Argentinas 385803648 6
## 4 Aeromexico* 596871813 3
## 5 Air Canada 1865253802 2
## 6 Air France 3004002661 14
## fatal_accidents_85_99 fatalities_85_99 incidents_00_14 fatal_accidents_00_14
## 1 0 0 0 0
## 2 14 128 6 1
## 3 0 0 1 0
## 4 1 64 5 0
## 5 0 0 2 0
## 6 4 79 6 2
## fatalities_00_14
## 1 0
## 2 88
## 3 0
## 4 0
## 5 0
## 6 337
Adding primary keys with the country names.
DF3 <- tibble::rowid_to_column(DF3, "ID")
head(DF3)
## ID airline avail_seat_km_per_week incidents_85_99
## 1 1 Aer Lingus 320906734 2
## 2 2 Aeroflot* 1197672318 76
## 3 3 Aerolineas Argentinas 385803648 6
## 4 4 Aeromexico* 596871813 3
## 5 5 Air Canada 1865253802 2
## 6 6 Air France 3004002661 14
## fatal_accidents_85_99 fatalities_85_99 incidents_00_14 fatal_accidents_00_14
## 1 0 0 0 0
## 2 14 128 6 1
## 3 0 0 1 0
## 4 1 64 5 0
## 5 0 0 2 0
## 6 4 79 6 2
## fatalities_00_14
## 1 0
## 2 88
## 3 0
## 4 0
## 5 0
## 6 337
Tidying: Made the date set into a data frame and used the pivot longer function to make the data longer and replace the columns ‘incidents_85_99’, ‘fatal_accidents_85_99’, ‘fatalities_85_99’, ‘incidents_00_1’, ‘fatal_accidents_00_14’, and ‘fatalities_00_14’ with ‘crash_type_with_yr_frames’ and the values to ‘crash_count’.
## ID airline avail_seat_km_per_week crash_type_with_yr_frames crash_count
## 1 1 Aer Lingus 320906734 incidents_85_99 2
## 2 1 Aer Lingus 320906734 fatal_accidents_85_99 0
## 3 1 Aer Lingus 320906734 fatalities_85_99 0
## 4 1 Aer Lingus 320906734 incidents_00_14 0
## 5 1 Aer Lingus 320906734 fatal_accidents_00_14 0
## 6 1 Aer Lingus 320906734 fatalities_00_14 0
Turned the data frame into a CSV file.
write_csv(DF3,file="airline_crashes.csv")
airline_crashes <- read_csv("airline_crashes.csv", show_col_types = FALSE)
head(airline_crashes)
## # A tibble: 6 × 5
## ID airline avail_seat_km_per_week crash_type_with_yr_frames crash_count
## <dbl> <chr> <dbl> <chr> <dbl>
## 1 1 Aer Lingus 320906734 incidents_85_99 2
## 2 1 Aer Lingus 320906734 fatal_accidents_85_99 0
## 3 1 Aer Lingus 320906734 fatalities_85_99 0
## 4 1 Aer Lingus 320906734 incidents_00_14 0
## 5 1 Aer Lingus 320906734 fatal_accidents_00_14 0
## 6 1 Aer Lingus 320906734 fatalities_00_14 0
Analysis: Lets see how which ariline had the highest crash count, in which we should avoid.
DF3_max<-DF3[order(DF3$crash_count, decreasing= TRUE),]
head(DF3_max)
## ID airline avail_seat_km_per_week crash_type_with_yr_frames
## 210 35 Malaysia Airlines 1039171244 fatalities_00_14
## 99 17 China Airlines 813216487 fatalities_85_99
## 171 29 Japan Airlines 1574217531 fatalities_85_99
## 189 32 Korean Air 1734522605 fatalities_85_99
## 72 12 American* 5228357340 fatalities_00_14
## 117 20 Delta / Northwest* 6525658894 fatalities_85_99
## crash_count
## 210 537
## 99 535
## 171 520
## 189 425
## 72 416
## 117 407
Malaysia airline seems to be the airline to avoid as it has the highest count of crashs with a count of 537 crashes. Well lets not forget on March 8,2014 flight 370 from Malaysia airline had disappeared and was never found.