The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.
Your task is to:
Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets: Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
1.** NY Adult Arrests** - Mike Silva
Questions: How have DWI felonies and misdemeanors changed since 2000, and what have been problematic areas
url <- 'https://raw.githubusercontent.com/hvasquez81/Data607-Project-2/master/Adult_Arrests_by_County___Beginning_1970.csv'
ny_adult_arrests <- read.csv(url(url), stringsAsFactors = F)
head(ny_adult_arrests)
## County Year Total Felony.Total Drug.Felony Violent.Felony DWI.Felony
## 1 Albany 1970 1226 688 97 191 5
## 2 Albany 1971 1833 829 131 231 6
## 3 Albany 1972 3035 1054 211 256 8
## 4 Albany 1973 3573 1134 244 274 28
## 5 Albany 1974 4255 1329 281 308 17
## 6 Albany 1975 4173 1259 209 344 12
## Other.Felony Misdemeanor.Total Drug.Misd DWI.Misd Property.Misd
## 1 395 538 207 48 95
## 2 461 1004 204 111 272
## 3 579 1981 285 297 541
## 4 588 2439 369 497 668
## 5 723 2926 437 619 885
## 6 694 2914 398 463 977
## Other.Misd
## 1 188
## 2 417
## 3 858
## 4 905
## 5 985
## 6 1076
str(ny_adult_arrests)
## 'data.frame': 3055 obs. of 13 variables:
## $ County : chr "Albany" "Albany" "Albany" "Albany" ...
## $ Year : int 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 ...
## $ Total : int 1226 1833 3035 3573 4255 4173 4601 4819 5768 6534 ...
## $ Felony.Total : int 688 829 1054 1134 1329 1259 1435 1342 1485 1660 ...
## $ Drug.Felony : int 97 131 211 244 281 209 201 122 85 144 ...
## $ Violent.Felony : int 191 231 256 274 308 344 434 403 433 513 ...
## $ DWI.Felony : int 5 6 8 28 17 12 26 45 58 65 ...
## $ Other.Felony : int 395 461 579 588 723 694 774 772 909 938 ...
## $ Misdemeanor.Total: int 538 1004 1981 2439 2926 2914 3166 3477 4283 4874 ...
## $ Drug.Misd : int 207 204 285 369 437 398 362 270 157 224 ...
## $ DWI.Misd : int 48 111 297 497 619 463 574 858 1540 1846 ...
## $ Property.Misd : int 95 272 541 668 885 977 1011 1133 1333 1418 ...
## $ Other.Misd : int 188 417 858 905 985 1076 1219 1216 1253 1386 ...
# Looking at years 2000 and over
ny_2000_set <- filter(ny_adult_arrests, Year >= 2000)
# What percentage of misd were DWI
ny_2000_set <- left_join(ny_2000_set, mutate(ny_2000_set, DWI.Misd.Percentage = 100*DWI.Misd/Misdemeanor.Total))
## Joining, by = c("County", "Year", "Total", "Felony.Total", "Drug.Felony", "Violent.Felony", "DWI.Felony", "Other.Felony", "Misdemeanor.Total", "Drug.Misd", "DWI.Misd", "Property.Misd", "Other.Misd")
# What percentage of Felonies were DWI
ny_2000_set <- left_join(ny_2000_set, mutate(ny_2000_set, DWI.Felony.Percentage = 100*DWI.Felony/Felony.Total))
## Joining, by = c("County", "Year", "Total", "Felony.Total", "Drug.Felony", "Violent.Felony", "DWI.Felony", "Other.Felony", "Misdemeanor.Total", "Drug.Misd", "DWI.Misd", "Property.Misd", "Other.Misd", "DWI.Misd.Percentage")
# what percentage of total arrests were DWI related
ny_2000_set <- left_join(ny_2000_set,mutate(ny_2000_set, DWI.Total.Percentage = 100*(DWI.Misd+DWI.Felony)/Total))
## Joining, by = c("County", "Year", "Total", "Felony.Total", "Drug.Felony", "Violent.Felony", "DWI.Felony", "Other.Felony", "Misdemeanor.Total", "Drug.Misd", "DWI.Misd", "Property.Misd", "Other.Misd", "DWI.Misd.Percentage", "DWI.Felony.Percentage")
# Turning data to long vs wide
ny_2000_set <- select(ny_2000_set, County,Year, DWI.Felony.Percentage,DWI.Misd.Percentage,DWI.Total.Percentage)
# Reordering data to show problematic areas
ny_2000_set <- arrange(ny_2000_set, Year, desc(DWI.Total.Percentage))
head(ny_2000_set)
## County Year DWI.Felony.Percentage DWI.Misd.Percentage
## 1 Hamilton 2000 33.333333 44.11765
## 2 Fulton 2000 14.511041 35.81616
## 3 Putnam 2000 4.814815 34.93151
## 4 Schuyler 2000 38.144330 25.97765
## 5 Lewis 2000 11.458333 33.11688
## 6 Genesee 2000 15.555556 30.71230
## DWI.Total.Percentage
## 1 42.50000
## 2 31.53895
## 3 28.63777
## 4 28.57143
## 5 27.97030
## 6 27.29226
From the dataset Hamilton county seems to have a lot a issues with DWI arrests. The county is among the top problematic county with DWI almost every year since 2000
hamilton_data <- filter(ny_2000_set, County == "Hamilton") %>% arrange(DWI.Total.Percentage)
hamilton_data
## County Year DWI.Felony.Percentage DWI.Misd.Percentage
## 1 Hamilton 2006 5.263158 25.67568
## 2 Hamilton 2007 9.523810 30.43478
## 3 Hamilton 2017 25.000000 32.25806
## 4 Hamilton 2005 16.666667 33.84615
## 5 Hamilton 2015 13.333333 36.53846
## 6 Hamilton 2013 4.761905 42.59259
## 7 Hamilton 2003 21.052632 35.29412
## 8 Hamilton 2004 12.000000 40.27778
## 9 Hamilton 2012 28.125000 40.00000
## 10 Hamilton 2016 0.000000 45.28302
## 11 Hamilton 2008 17.647059 44.00000
## 12 Hamilton 2014 18.181818 43.63636
## 13 Hamilton 2000 33.333333 44.11765
## 14 Hamilton 2010 21.052632 48.68421
## 15 Hamilton 2011 33.333333 49.23077
## 16 Hamilton 2002 25.000000 51.51515
## 17 Hamilton 2001 12.500000 60.71429
## 18 Hamilton 2009 62.500000 60.00000
## DWI.Total.Percentage
## 1 21.50538
## 2 25.55556
## 3 29.41176
## 4 31.16883
## 5 31.34328
## 6 32.00000
## 7 32.69231
## 8 32.98969
## 9 36.27451
## 10 36.36364
## 11 37.31343
## 12 39.39394
## 13 42.50000
## 14 43.15789
## 15 46.25000
## 16 47.43590
## 17 50.00000
## 18 60.34483
Since 2010, 5 of the 7 years are in the top 10.
ggplot(hamilton_data, aes(Year,DWI.Total.Percentage)) + geom_point()
From the plot above, we can see that DWI were on a downtrend from 2001 to 2007. We see a spike in 2008-2009 before a new downtrend. It almost appears as if the trend resembles a sin/cos wave.
Question: Compare the top 5 age groups of migrants to the US in 1990 to 2017.
url <- 'https://raw.githubusercontent.com/hvasquez81/Data607-Project-2/master/UN_MigrantStockByAge_2017.csv'
UN_dataset <- read.csv(url(url), stringsAsFactors = F, header = T)
colnames(UN_dataset) <- c('Year','Order','Destination','Notes','Code','Type','0-4','5-9','10-14', '15-19','20-24','25-29','30-34','35-39','40-44','45-49','50-54','55-59','60-64','65-69','70-74','75+')
UN_dataset[,7:22] <- sapply(UN_dataset[,7:22], as.factor)
UN_dataset[,7:22] <- sapply(UN_dataset[,7:22], as.numeric)
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
UN_dataset <- UN_dataset[-c(1:10),]
UN.df <- as.data.frame((UN_dataset))
Pull the year 1990 with destination being the US, we need to select the age groups, year and destination for the dataframe. We then use the gather() function to turn the dataset into a long format. Then rearrange and show the top 5 largest age groups.
UN.df.1990 <- filter(UN.df, Year == '1990', Destination == 'United States of America') %>%
select(Year,Destination, '0-4':'75+') %>%
gather(AgeGroup, Migrants, '0-4':'75+') %>%
arrange(desc(Migrants))
UN.df.1990.top5 <-head(UN.df.1990,5)
UN.df.1990.top5
## Year Destination AgeGroup Migrants
## 1 1990 United States of America 30-34 2707615
## 2 1990 United States of America 25-29 2653542
## 3 1990 United States of America 35-39 2371781
## 4 1990 United States of America 20-24 2131128
## 5 1990 United States of America 40-44 2078364
Repeat previous step but for 2017
UN.df.2017 <- filter(UN.df, Year == '2017', Destination == 'United States of America') %>%
select(Year,Destination, '0-4':'75+') %>%
gather(AgeGroup, Migrants, '0-4':'75+') %>%
arrange(desc(Migrants))
UN.df.2017.top5 <-head(UN.df.2017,5)
UN.df.2017.top5
## Year Destination AgeGroup Migrants
## 1 2017 United States of America 40-44 5271032
## 2 2017 United States of America 35-39 5158987
## 3 2017 United States of America 45-49 5049351
## 4 2017 United States of America 30-34 4733655
## 5 2017 United States of America 50-54 4541663
Combine the 2 datasets
migrant.us.1990_2017 <- rbind(UN.df.1990.top5,UN.df.2017.top5)
migrant.us.1990_2017
## Year Destination AgeGroup Migrants
## 1 1990 United States of America 30-34 2707615
## 2 1990 United States of America 25-29 2653542
## 3 1990 United States of America 35-39 2371781
## 4 1990 United States of America 20-24 2131128
## 5 1990 United States of America 40-44 2078364
## 6 2017 United States of America 40-44 5271032
## 7 2017 United States of America 35-39 5158987
## 8 2017 United States of America 45-49 5049351
## 9 2017 United States of America 30-34 4733655
## 10 2017 United States of America 50-54 4541663
ggplot(data = migrant.us.1990_2017, aes(x = AgeGroup,y = Migrants, fill = Year)) + geom_bar(position = 'stack',stat = 'identity')
Based from the graph above, 1990 seemed to have a bigger bulk of younger migrants compared to 2017. We see that in 2017, the age group of 20-24 and 25-29 is not one of the top 5 groups. In 1990 also, ages 45-54 were not a major group of migrants to the US. The trend presented here is that the older groups of migrants to the US are starting to become more dominant in 2017 as opposed to younger age groups in 1990.
Question: Find out which airlines had the worst record on fatalities and incidents. Note- I read a book called ‘Outliers’ and in one of the chapters talks about the areas in which a pilot are born can determine if they will be a good pilot of not. The assumption is that countries that have very high PDI numbers will generally have captains more suceptible to accidents. I hypothesize that the airlines with most accidents occur in the countries that have high PDIs. The research focused on Korea in the 80s and 90s so I will do so with the data set as well using the top 10 and worst 10 airlines by fatalities.
Here is the website where the PDI numbers are located http://clearlycultural.com/geert-hofstede-cultural-dimensions/power-distance-index/
url <- url <- 'https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv'
airline_safety <- as.data.frame(read.csv(url(url), stringsAsFactors = F, header = T))
Lets separate the data by year,
# Most fatalities top 10 and least fatal top 10
airline_safety_85_99 <- airline_safety %>% select(airline,incidents_85_99:fatalities_85_99) %>%
arrange(desc(fatalities_85_99))
airline_safety_85_99 <- rbind(head(airline_safety_85_99,10),tail(airline_safety_85_99,10))
PDI_country <- c(80,54,60,40,77,67,40,80,64,80,68,33,80,40,64,36,31,40,63,35)
airline_safety_85_99 <- cbind(airline_safety_85_99,PDI_country)
airline_safety_85_99
## airline incidents_85_99 fatal_accidents_85_99
## 1 China Airlines 12 6
## 2 Japan Airlines 3 1
## 3 Korean Air 12 5
## 4 Delta / Northwest* 24 12
## 5 Air India* 2 1
## 6 Avianca 5 3
## 7 United / Continental* 19 8
## 8 Saudi Arabian 7 2
## 9 Thai Airways 8 4
## 10 Egyptair 8 3
## 47 Cathay Pacific* 0 0
## 48 Finnair 1 0
## 49 Gulf Air 1 0
## 50 Hawaiian Airlines 0 0
## 51 Kenya Airways 2 0
## 52 Qantas* 1 0
## 53 SAS* 5 0
## 54 Southwest Airlines 1 0
## 55 TAP - Air Portugal 0 0
## 56 Virgin Atlantic 1 0
## fatalities_85_99 PDI_country
## 1 535 80
## 2 520 54
## 3 425 60
## 4 407 40
## 5 329 77
## 6 323 67
## 7 319 40
## 8 313 80
## 9 308 64
## 10 282 80
## 47 0 68
## 48 0 33
## 49 0 80
## 50 0 40
## 51 0 64
## 52 0 36
## 53 0 31
## 54 0 40
## 55 0 63
## 56 0 35
Lets plot fatalities vs PDI
ggplot(airline_safety_85_99, aes(x = PDI_country, y = fatalities_85_99)) + geom_point()
Lets plot indidents vs PDI plots
ggplot(airline_safety_85_99, aes(x = PDI_country, y = incidents_85_99)) + geom_point()
Generally speaking from the 2 graphs above we see that pilots in airlines based from countries with higher PDIs are more susceptible to causing fatalities in their accidents as well as more likley to be involved in an incident between 1985-1999 In the second chart it is more noticeable with the exception of the airlines based in the United States having a larged amount of incidents. There are also a couple of airlines when rather high PDIs that happened to have zero incidents and zero fatalities. I would suggest there is a slight trend based off of this current subset that airlines located in countries with higher PDIs are more susceptible to incidents that those with lower PDIs.