Project 2

To choose three data sets that are untidy, read it as a CSV file, tidy and transform that data.

Loaded needed packages

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)

##Data set 1

In the first data set, the data is about the alcohol consumption in 2010 for different countries. The alcohols are seperated into beers, spirit, and wine. To make the data set tidy I will create a column for the types of alcohols servings and another column with the amount of servings to replace ‘beer_servings’, ‘spirit_servings’, ‘wine_servings’.

Article: https://fivethirtyeight.com/features/dear-mona-followup-where-do-people-drink-the-most-beer-wine-and-spirits/

Data: https://github.com/fivethirtyeight/data/tree/master/alcohol-consumption

##             country beer_servings spirit_servings wine_servings
## 1       Afghanistan             0               0             0
## 2           Albania            89             132            54
## 3           Algeria            25               0            14
## 4           Andorra           245             138           312
## 5            Angola           217              57            45
## 6 Antigua & Barbuda           102             128            45
##   total_litres_of_pure_alcohol
## 1                          0.0
## 2                          4.9
## 3                          0.7
## 4                         12.4
## 5                          5.9
## 6                          4.9

Adding primary keys with the country names.

DF1 <- tibble::rowid_to_column(DF1, "ID")
head(DF1)
##   ID           country beer_servings spirit_servings wine_servings
## 1  1       Afghanistan             0               0             0
## 2  2           Albania            89             132            54
## 3  3           Algeria            25               0            14
## 4  4           Andorra           245             138           312
## 5  5            Angola           217              57            45
## 6  6 Antigua & Barbuda           102             128            45
##   total_litres_of_pure_alcohol
## 1                          0.0
## 2                          4.9
## 3                          0.7
## 4                         12.4
## 5                          5.9
## 6                          4.9

Tidying: Made the date set into a data frame and used the pivot longer function to make the data longer and replace the columns ‘beer_servings’, ‘spirit_servings’, ‘wine_servings’, and ‘total_litres_of_pure_alcohol’ with ‘alcohol_type_of_serving’ and ‘serving_amount’.

##   ID     country      alcohol_type_of_serving serving_amount
## 1  1 Afghanistan                beer_servings              0
## 2  1 Afghanistan              spirit_servings              0
## 3  1 Afghanistan                wine_servings              0
## 4  1 Afghanistan total_litres_of_pure_alcohol              0
## 5  2     Albania                beer_servings             89
## 6  2     Albania              spirit_servings            132

#Make data frame as CSV file and test

write_csv(DF1,file="Alcohol_within_countries.csv")
Test1<-read_csv("Alcohol_within_countries.csv", show_col_types = FALSE)
head(Test1)
## # A tibble: 6 × 4
##      ID country     alcohol_type_of_serving      serving_amount
##   <dbl> <chr>       <chr>                                 <dbl>
## 1     1 Afghanistan beer_servings                             0
## 2     1 Afghanistan spirit_servings                           0
## 3     1 Afghanistan wine_servings                             0
## 4     1 Afghanistan total_litres_of_pure_alcohol              0
## 5     2 Albania     beer_servings                            89
## 6     2 Albania     spirit_servings                         132

Now that the data is tidy and transformed, I will perform my analysis to find the country that has the highest alcohol servings consumed. I transformed the data by sorting the serving amounts by descending order.

DF1_max<-DF1[order(DF1$serving_amount, decreasing= TRUE),]
head(DF1_max)
##      ID        country alcohol_type_of_serving serving_amount
## 274  69        Grenada         spirit_servings            438
## 469 118        Namibia           beer_servings            376
## 62   16        Belarus         spirit_servings            373
## 247  62         France           wine_servings            370
## 181  46 Czech Republic           beer_servings            361
## 249  63          Gabon           beer_servings            347
DF1_max<-filter(DF1_max,serving_amount>=347)#tranformed the data using filter to only display data for serving amount that are greater than or equal to 347.
ggplot(DF1_max, aes(x=country, y=serving_amount, fill=alcohol_type_of_serving)) +
  geom_col(position="dodge")

In conclusion I was able to tidy and transform the data on alcohol consumption from different countries and I was able to get that Grenada has the highest alcohol serving consumption, and it’s mainly in teh type of spirit.

##Data set 2

I will work with the data set is on the death causes in New York since 2017. The data set has two columns for different death rates which can be combined to be a death rate type and their values as rates.

Data from:https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam/data_preview

##   Year                                                           Leading.Cause
## 1 2011 Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)
## 2 2009                     Human Immunodeficiency Virus Disease (HIV: B20-B24)
## 3 2009                            Chronic Lower Respiratory Diseases (J40-J47)
## 4 2008                          Diseases of Heart (I00-I09, I11, I13, I20-I51)
## 5 2009                                               Alzheimer's Disease (G30)
## 6 2008        Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)
##   Sex             Race.Ethnicity Deaths Death.Rate Age.Adjusted.Death.Rate
## 1   F         Black Non-Hispanic     83        7.9                     6.9
## 2   F                   Hispanic     96          8                     8.1
## 3   F                   Hispanic    155       12.9                      16
## 4   F                   Hispanic   1445      122.3                   160.7
## 5   F Asian and Pacific Islander     14        2.5                     3.6
## 6   F Asian and Pacific Islander     36        6.8                     8.5

Rename:Transform data by rename columns, order to use them in R. By replaceing th ‘.’ with underscores.

DF2<-DF2%>%rename("Leading_cause"=Leading.Cause, "Death_rate"=Death.Rate,"Age_adjusted_death_rate"=Age.Adjusted.Death.Rate, "Race_Ethnicity"=Race.Ethnicity)

head(DF2)
##   Year                                                           Leading_cause
## 1 2011 Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)
## 2 2009                     Human Immunodeficiency Virus Disease (HIV: B20-B24)
## 3 2009                            Chronic Lower Respiratory Diseases (J40-J47)
## 4 2008                          Diseases of Heart (I00-I09, I11, I13, I20-I51)
## 5 2009                                               Alzheimer's Disease (G30)
## 6 2008        Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)
##   Sex             Race_Ethnicity Deaths Death_rate Age_adjusted_death_rate
## 1   F         Black Non-Hispanic     83        7.9                     6.9
## 2   F                   Hispanic     96          8                     8.1
## 3   F                   Hispanic    155       12.9                      16
## 4   F                   Hispanic   1445      122.3                   160.7
## 5   F Asian and Pacific Islander     14        2.5                     3.6
## 6   F Asian and Pacific Islander     36        6.8                     8.5

Reorganize that columns to have the leading cause first for the primary keys.

DF2<-DF2[,c(2,1,3,4,5,6,7)]
head(DF2)
##                                                             Leading_cause Year
## 1 Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27) 2011
## 2                     Human Immunodeficiency Virus Disease (HIV: B20-B24) 2009
## 3                            Chronic Lower Respiratory Diseases (J40-J47) 2009
## 4                          Diseases of Heart (I00-I09, I11, I13, I20-I51) 2008
## 5                                               Alzheimer's Disease (G30) 2009
## 6        Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86) 2008
##   Sex             Race_Ethnicity Deaths Death_rate Age_adjusted_death_rate
## 1   F         Black Non-Hispanic     83        7.9                     6.9
## 2   F                   Hispanic     96          8                     8.1
## 3   F                   Hispanic    155       12.9                      16
## 4   F                   Hispanic   1445      122.3                   160.7
## 5   F Asian and Pacific Islander     14        2.5                     3.6
## 6   F Asian and Pacific Islander     36        6.8                     8.5

Create primary keys

DF2 <- tibble::rowid_to_column(DF2, "ID")
head(DF2)             
##   ID                                                           Leading_cause
## 1  1 Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)
## 2  2                     Human Immunodeficiency Virus Disease (HIV: B20-B24)
## 3  3                            Chronic Lower Respiratory Diseases (J40-J47)
## 4  4                          Diseases of Heart (I00-I09, I11, I13, I20-I51)
## 5  5                                               Alzheimer's Disease (G30)
## 6  6        Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)
##   Year Sex             Race_Ethnicity Deaths Death_rate Age_adjusted_death_rate
## 1 2011   F         Black Non-Hispanic     83        7.9                     6.9
## 2 2009   F                   Hispanic     96          8                     8.1
## 3 2009   F                   Hispanic    155       12.9                      16
## 4 2008   F                   Hispanic   1445      122.3                   160.7
## 5 2009   F Asian and Pacific Islander     14        2.5                     3.6
## 6 2008   F Asian and Pacific Islander     36        6.8                     8.5
DF2[DF2 == '.'] <- NA#Will turn all '.' in NA

Tidy:

DF2<-data.frame(DF2%>%
  pivot_longer(c(Death_rate, Age_adjusted_death_rate), names_to = 'Death_rate_type', values_to = 'rate',values_drop_na = TRUE))# values_drop_na turns the missing values implicit when TRUE.
head(DF2)
##   ID                                                           Leading_cause
## 1  1 Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)
## 2  1 Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)
## 3  2                     Human Immunodeficiency Virus Disease (HIV: B20-B24)
## 4  2                     Human Immunodeficiency Virus Disease (HIV: B20-B24)
## 5  3                            Chronic Lower Respiratory Diseases (J40-J47)
## 6  3                            Chronic Lower Respiratory Diseases (J40-J47)
##   Year Sex     Race_Ethnicity Deaths         Death_rate_type rate
## 1 2011   F Black Non-Hispanic     83              Death_rate  7.9
## 2 2011   F Black Non-Hispanic     83 Age_adjusted_death_rate  6.9
## 3 2009   F           Hispanic     96              Death_rate    8
## 4 2009   F           Hispanic     96 Age_adjusted_death_rate  8.1
## 5 2009   F           Hispanic    155              Death_rate 12.9
## 6 2009   F           Hispanic    155 Age_adjusted_death_rate   16

Making rates and deaths numeric in order to us in further calculations.

DF2$rate<-as.numeric(DF2$rate)
DF2$Deaths<-as.numeric(DF2$Deaths)

Visualization of the death rate distribution, which seems skewed to the right.

ggplot(DF2, aes(rate)) +
  geom_histogram(width = 30)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 686 rows containing non-finite outside the scale range
## (`stat_bin()`).

Turned the data frame into a CSV file.

write_csv(DF2,file="Death_in_NY.csv")
Death_NY <- read_csv("Death_in_NY.csv", show_col_types = FALSE)
head(Death_NY)
## # A tibble: 6 × 8
##      ID Leading_cause     Year Sex   Race_Ethnicity Deaths Death_rate_type  rate
##   <dbl> <chr>            <dbl> <chr> <chr>           <dbl> <chr>           <dbl>
## 1     1 Nephritis, Neph…  2011 F     Black Non-His…     83 Death_rate        7.9
## 2     1 Nephritis, Neph…  2011 F     Black Non-His…     83 Age_adjusted_d…   6.9
## 3     2 Human Immunodef…  2009 F     Hispanic           96 Death_rate        8  
## 4     2 Human Immunodef…  2009 F     Hispanic           96 Age_adjusted_d…   8.1
## 5     3 Chronic Lower R…  2009 F     Hispanic          155 Death_rate       12.9
## 6     3 Chronic Lower R…  2009 F     Hispanic          155 Age_adjusted_d…  16

Now that we have the data tidy, I will analyze how the mean of the death rate to see what is the average of the death rates in NY. I will use the summarise function to transform the data.

DF2_Mean<-DF2%>%
  summarise(Mean=mean(rate, na.rm = TRUE))#na.rm set at true to remove missing values to not including into the calculations
DF2_Mean
##       Mean
## 1 54.93273
DF2[8,]#selecting row for heart disease in hispanic females
##   ID                                  Leading_cause Year Sex Race_Ethnicity
## 8  4 Diseases of Heart (I00-I09, I11, I13, I20-I51) 2008   F       Hispanic
##   Deaths         Death_rate_type  rate
## 8   1445 Age_adjusted_death_rate 160.7

In conclusion I was able to determine the average of death rate in NY, which is 54.93273. Comparing the rate of age adjusted death rat for heart disease in femal hispanic in 2018, the death rate is well above the average death rate.

##Data set 3

Is a data set on airline and their crashes, which was used for an article to discuss which airlines people should avoid based on their crashes. For this data set there were many columns for diferent types of crashes to tidy up the data a little I combined the crash types into one column and had their values in a seperate column as the crash counts.

Article:https://fivethirtyeight.com/features/should-travelers-avoid-flying-airlines-that-have-had-crashes-in-the-past/

Data: https://github.com/fivethirtyeight/data/tree/master/airline-safety

##                 airline avail_seat_km_per_week incidents_85_99
## 1            Aer Lingus              320906734               2
## 2             Aeroflot*             1197672318              76
## 3 Aerolineas Argentinas              385803648               6
## 4           Aeromexico*              596871813               3
## 5            Air Canada             1865253802               2
## 6            Air France             3004002661              14
##   fatal_accidents_85_99 fatalities_85_99 incidents_00_14 fatal_accidents_00_14
## 1                     0                0               0                     0
## 2                    14              128               6                     1
## 3                     0                0               1                     0
## 4                     1               64               5                     0
## 5                     0                0               2                     0
## 6                     4               79               6                     2
##   fatalities_00_14
## 1                0
## 2               88
## 3                0
## 4                0
## 5                0
## 6              337

Adding primary keys with the country names.

DF3 <- tibble::rowid_to_column(DF3, "ID")
head(DF3)
##   ID               airline avail_seat_km_per_week incidents_85_99
## 1  1            Aer Lingus              320906734               2
## 2  2             Aeroflot*             1197672318              76
## 3  3 Aerolineas Argentinas              385803648               6
## 4  4           Aeromexico*              596871813               3
## 5  5            Air Canada             1865253802               2
## 6  6            Air France             3004002661              14
##   fatal_accidents_85_99 fatalities_85_99 incidents_00_14 fatal_accidents_00_14
## 1                     0                0               0                     0
## 2                    14              128               6                     1
## 3                     0                0               1                     0
## 4                     1               64               5                     0
## 5                     0                0               2                     0
## 6                     4               79               6                     2
##   fatalities_00_14
## 1                0
## 2               88
## 3                0
## 4                0
## 5                0
## 6              337

Tidying: Made the date set into a data frame and used the pivot longer function to make the data longer and replace the columns ‘incidents_85_99’, ‘fatal_accidents_85_99’, ‘fatalities_85_99’, ‘incidents_00_1’, ‘fatal_accidents_00_14’, and ‘fatalities_00_14’ with ‘crash_type_with_yr_frames’ and the values to ‘crash_count’.

##   ID    airline avail_seat_km_per_week crash_type_with_yr_frames crash_count
## 1  1 Aer Lingus              320906734           incidents_85_99           2
## 2  1 Aer Lingus              320906734     fatal_accidents_85_99           0
## 3  1 Aer Lingus              320906734          fatalities_85_99           0
## 4  1 Aer Lingus              320906734           incidents_00_14           0
## 5  1 Aer Lingus              320906734     fatal_accidents_00_14           0
## 6  1 Aer Lingus              320906734          fatalities_00_14           0

Turned the data frame into a CSV file.

write_csv(DF3,file="airline_crashes.csv")
airline_crashes <- read_csv("airline_crashes.csv", show_col_types = FALSE)
head(airline_crashes)
## # A tibble: 6 × 5
##      ID airline    avail_seat_km_per_week crash_type_with_yr_frames crash_count
##   <dbl> <chr>                       <dbl> <chr>                           <dbl>
## 1     1 Aer Lingus              320906734 incidents_85_99                     2
## 2     1 Aer Lingus              320906734 fatal_accidents_85_99               0
## 3     1 Aer Lingus              320906734 fatalities_85_99                    0
## 4     1 Aer Lingus              320906734 incidents_00_14                     0
## 5     1 Aer Lingus              320906734 fatal_accidents_00_14               0
## 6     1 Aer Lingus              320906734 fatalities_00_14                    0

Analysis: Lets see how which ariline had the highest crash count, in which we should avoid.

DF3_max<-DF3[order(DF3$crash_count, decreasing= TRUE),]
head(DF3_max)
##     ID            airline avail_seat_km_per_week crash_type_with_yr_frames
## 210 35  Malaysia Airlines             1039171244          fatalities_00_14
## 99  17     China Airlines              813216487          fatalities_85_99
## 171 29     Japan Airlines             1574217531          fatalities_85_99
## 189 32         Korean Air             1734522605          fatalities_85_99
## 72  12          American*             5228357340          fatalities_00_14
## 117 20 Delta / Northwest*             6525658894          fatalities_85_99
##     crash_count
## 210         537
## 99          535
## 171         520
## 189         425
## 72          416
## 117         407

Malaysia airline seems to be the airline to avoid as it has the highest count of crashs with a count of 537 crashes. Well lets not forget on March 8,2014 flight 370 from Malaysia airline had disappeared and was never found.