The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.

Your task is to:

  1. Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets: Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.

  2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]

  3. Perform the analysis requested in the discussion item.

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

1

1.** NY Adult Arrests** - Mike Silva

Questions: How have DWI felonies and misdemeanors changed since 2000, and what have been problematic areas

  1. Load the data set
url <- 'https://raw.githubusercontent.com/hvasquez81/Data607-Project-2/master/Adult_Arrests_by_County___Beginning_1970.csv'
ny_adult_arrests <- read.csv(url(url), stringsAsFactors = F)
head(ny_adult_arrests)
##   County Year Total Felony.Total Drug.Felony Violent.Felony DWI.Felony
## 1 Albany 1970  1226          688          97            191          5
## 2 Albany 1971  1833          829         131            231          6
## 3 Albany 1972  3035         1054         211            256          8
## 4 Albany 1973  3573         1134         244            274         28
## 5 Albany 1974  4255         1329         281            308         17
## 6 Albany 1975  4173         1259         209            344         12
##   Other.Felony Misdemeanor.Total Drug.Misd DWI.Misd Property.Misd
## 1          395               538       207       48            95
## 2          461              1004       204      111           272
## 3          579              1981       285      297           541
## 4          588              2439       369      497           668
## 5          723              2926       437      619           885
## 6          694              2914       398      463           977
##   Other.Misd
## 1        188
## 2        417
## 3        858
## 4        905
## 5        985
## 6       1076
str(ny_adult_arrests)
## 'data.frame':    3055 obs. of  13 variables:
##  $ County           : chr  "Albany" "Albany" "Albany" "Albany" ...
##  $ Year             : int  1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 ...
##  $ Total            : int  1226 1833 3035 3573 4255 4173 4601 4819 5768 6534 ...
##  $ Felony.Total     : int  688 829 1054 1134 1329 1259 1435 1342 1485 1660 ...
##  $ Drug.Felony      : int  97 131 211 244 281 209 201 122 85 144 ...
##  $ Violent.Felony   : int  191 231 256 274 308 344 434 403 433 513 ...
##  $ DWI.Felony       : int  5 6 8 28 17 12 26 45 58 65 ...
##  $ Other.Felony     : int  395 461 579 588 723 694 774 772 909 938 ...
##  $ Misdemeanor.Total: int  538 1004 1981 2439 2926 2914 3166 3477 4283 4874 ...
##  $ Drug.Misd        : int  207 204 285 369 437 398 362 270 157 224 ...
##  $ DWI.Misd         : int  48 111 297 497 619 463 574 858 1540 1846 ...
##  $ Property.Misd    : int  95 272 541 668 885 977 1011 1133 1333 1418 ...
##  $ Other.Misd       : int  188 417 858 905 985 1076 1219 1216 1253 1386 ...
# Looking at years 2000 and over
ny_2000_set <- filter(ny_adult_arrests, Year >= 2000)
# What percentage of misd were DWI
ny_2000_set <- left_join(ny_2000_set, mutate(ny_2000_set, DWI.Misd.Percentage = 100*DWI.Misd/Misdemeanor.Total))
## Joining, by = c("County", "Year", "Total", "Felony.Total", "Drug.Felony", "Violent.Felony", "DWI.Felony", "Other.Felony", "Misdemeanor.Total", "Drug.Misd", "DWI.Misd", "Property.Misd", "Other.Misd")
# What percentage of Felonies were DWI
ny_2000_set <- left_join(ny_2000_set, mutate(ny_2000_set, DWI.Felony.Percentage = 100*DWI.Felony/Felony.Total))
## Joining, by = c("County", "Year", "Total", "Felony.Total", "Drug.Felony", "Violent.Felony", "DWI.Felony", "Other.Felony", "Misdemeanor.Total", "Drug.Misd", "DWI.Misd", "Property.Misd", "Other.Misd", "DWI.Misd.Percentage")
# what percentage of total arrests were DWI related
ny_2000_set <- left_join(ny_2000_set,mutate(ny_2000_set, DWI.Total.Percentage = 100*(DWI.Misd+DWI.Felony)/Total))
## Joining, by = c("County", "Year", "Total", "Felony.Total", "Drug.Felony", "Violent.Felony", "DWI.Felony", "Other.Felony", "Misdemeanor.Total", "Drug.Misd", "DWI.Misd", "Property.Misd", "Other.Misd", "DWI.Misd.Percentage", "DWI.Felony.Percentage")
# Turning data to long vs wide
ny_2000_set <- select(ny_2000_set, County,Year, DWI.Felony.Percentage,DWI.Misd.Percentage,DWI.Total.Percentage)
# Reordering data to show problematic areas
ny_2000_set <- arrange(ny_2000_set, Year, desc(DWI.Total.Percentage))
head(ny_2000_set)
##     County Year DWI.Felony.Percentage DWI.Misd.Percentage
## 1 Hamilton 2000             33.333333            44.11765
## 2   Fulton 2000             14.511041            35.81616
## 3   Putnam 2000              4.814815            34.93151
## 4 Schuyler 2000             38.144330            25.97765
## 5    Lewis 2000             11.458333            33.11688
## 6  Genesee 2000             15.555556            30.71230
##   DWI.Total.Percentage
## 1             42.50000
## 2             31.53895
## 3             28.63777
## 4             28.57143
## 5             27.97030
## 6             27.29226

From the dataset Hamilton county seems to have a lot a issues with DWI arrests. The county is among the top problematic county with DWI almost every year since 2000

hamilton_data <- filter(ny_2000_set, County == "Hamilton") %>% arrange(DWI.Total.Percentage)
hamilton_data
##      County Year DWI.Felony.Percentage DWI.Misd.Percentage
## 1  Hamilton 2006              5.263158            25.67568
## 2  Hamilton 2007              9.523810            30.43478
## 3  Hamilton 2017             25.000000            32.25806
## 4  Hamilton 2005             16.666667            33.84615
## 5  Hamilton 2015             13.333333            36.53846
## 6  Hamilton 2013              4.761905            42.59259
## 7  Hamilton 2003             21.052632            35.29412
## 8  Hamilton 2004             12.000000            40.27778
## 9  Hamilton 2012             28.125000            40.00000
## 10 Hamilton 2016              0.000000            45.28302
## 11 Hamilton 2008             17.647059            44.00000
## 12 Hamilton 2014             18.181818            43.63636
## 13 Hamilton 2000             33.333333            44.11765
## 14 Hamilton 2010             21.052632            48.68421
## 15 Hamilton 2011             33.333333            49.23077
## 16 Hamilton 2002             25.000000            51.51515
## 17 Hamilton 2001             12.500000            60.71429
## 18 Hamilton 2009             62.500000            60.00000
##    DWI.Total.Percentage
## 1              21.50538
## 2              25.55556
## 3              29.41176
## 4              31.16883
## 5              31.34328
## 6              32.00000
## 7              32.69231
## 8              32.98969
## 9              36.27451
## 10             36.36364
## 11             37.31343
## 12             39.39394
## 13             42.50000
## 14             43.15789
## 15             46.25000
## 16             47.43590
## 17             50.00000
## 18             60.34483

Since 2010, 5 of the 7 years are in the top 10.

ggplot(hamilton_data, aes(Year,DWI.Total.Percentage)) + geom_point()

From the plot above, we can see that DWI were on a downtrend from 2001 to 2007. We see a spike in 2008-2009 before a new downtrend. It almost appears as if the trend resembles a sin/cos wave.

2

  1. UN Migrant Stock by Origin and Destination - Juanelle Marks

Question: Compare the top 5 age groups of migrants to the US in 1990 to 2017.

  1. Load the dataset
url <- 'https://raw.githubusercontent.com/hvasquez81/Data607-Project-2/master/UN_MigrantStockByAge_2017.csv'
UN_dataset <- read.csv(url(url), stringsAsFactors = F, header = T)
colnames(UN_dataset) <- c('Year','Order','Destination','Notes','Code','Type','0-4','5-9','10-14',   '15-19','20-24','25-29','30-34','35-39','40-44','45-49','50-54','55-59','60-64','65-69','70-74','75+')

UN_dataset[,7:22] <- sapply(UN_dataset[,7:22], as.factor)
UN_dataset[,7:22] <- sapply(UN_dataset[,7:22], as.numeric)
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
UN_dataset <- UN_dataset[-c(1:10),]
UN.df <- as.data.frame((UN_dataset))

Pull the year 1990 with destination being the US, we need to select the age groups, year and destination for the dataframe. We then use the gather() function to turn the dataset into a long format. Then rearrange and show the top 5 largest age groups.

UN.df.1990 <- filter(UN.df, Year == '1990', Destination == 'United States of America') %>%
              select(Year,Destination, '0-4':'75+') %>%
              gather(AgeGroup, Migrants, '0-4':'75+') %>%
              arrange(desc(Migrants))

UN.df.1990.top5 <-head(UN.df.1990,5)
UN.df.1990.top5
##   Year              Destination AgeGroup Migrants
## 1 1990 United States of America    30-34  2707615
## 2 1990 United States of America    25-29  2653542
## 3 1990 United States of America    35-39  2371781
## 4 1990 United States of America    20-24  2131128
## 5 1990 United States of America    40-44  2078364

Repeat previous step but for 2017

UN.df.2017 <- filter(UN.df, Year == '2017', Destination == 'United States of America') %>%
              select(Year,Destination, '0-4':'75+') %>%
              gather(AgeGroup, Migrants, '0-4':'75+') %>%
              arrange(desc(Migrants))

UN.df.2017.top5 <-head(UN.df.2017,5)
UN.df.2017.top5
##   Year              Destination AgeGroup Migrants
## 1 2017 United States of America    40-44  5271032
## 2 2017 United States of America    35-39  5158987
## 3 2017 United States of America    45-49  5049351
## 4 2017 United States of America    30-34  4733655
## 5 2017 United States of America    50-54  4541663

Combine the 2 datasets

migrant.us.1990_2017 <- rbind(UN.df.1990.top5,UN.df.2017.top5)
migrant.us.1990_2017
##    Year              Destination AgeGroup Migrants
## 1  1990 United States of America    30-34  2707615
## 2  1990 United States of America    25-29  2653542
## 3  1990 United States of America    35-39  2371781
## 4  1990 United States of America    20-24  2131128
## 5  1990 United States of America    40-44  2078364
## 6  2017 United States of America    40-44  5271032
## 7  2017 United States of America    35-39  5158987
## 8  2017 United States of America    45-49  5049351
## 9  2017 United States of America    30-34  4733655
## 10 2017 United States of America    50-54  4541663
ggplot(data = migrant.us.1990_2017, aes(x = AgeGroup,y = Migrants, fill = Year)) + geom_bar(position = 'stack',stat = 'identity')

Based from the graph above, 1990 seemed to have a bigger bulk of younger migrants compared to 2017. We see that in 2017, the age group of 20-24 and 25-29 is not one of the top 5 groups. In 1990 also, ages 45-54 were not a major group of migrants to the US. The trend presented here is that the older groups of migrants to the US are starting to become more dominant in 2017 as opposed to younger age groups in 1990.

3

  1. Airline Safety - Ravi Itwaru

Question: Find out which airlines had the worst record on fatalities and incidents. Note- I read a book called ‘Outliers’ and in one of the chapters talks about the areas in which a pilot are born can determine if they will be a good pilot of not. The assumption is that countries that have very high PDI numbers will generally have captains more suceptible to accidents. I hypothesize that the airlines with most accidents occur in the countries that have high PDIs. The research focused on Korea in the 80s and 90s so I will do so with the data set as well using the top 10 and worst 10 airlines by fatalities.

Here is the website where the PDI numbers are located http://clearlycultural.com/geert-hofstede-cultural-dimensions/power-distance-index/

  1. Loading data and tidying
url <- url <- 'https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv'
airline_safety <- as.data.frame(read.csv(url(url), stringsAsFactors = F, header = T))

Lets separate the data by year,

# Most fatalities top 10 and least fatal top 10
airline_safety_85_99 <- airline_safety %>% select(airline,incidents_85_99:fatalities_85_99) %>%
                        arrange(desc(fatalities_85_99))
airline_safety_85_99 <- rbind(head(airline_safety_85_99,10),tail(airline_safety_85_99,10))
PDI_country <- c(80,54,60,40,77,67,40,80,64,80,68,33,80,40,64,36,31,40,63,35)
airline_safety_85_99 <- cbind(airline_safety_85_99,PDI_country)

airline_safety_85_99
##                  airline incidents_85_99 fatal_accidents_85_99
## 1         China Airlines              12                     6
## 2         Japan Airlines               3                     1
## 3             Korean Air              12                     5
## 4     Delta / Northwest*              24                    12
## 5             Air India*               2                     1
## 6                Avianca               5                     3
## 7  United / Continental*              19                     8
## 8          Saudi Arabian               7                     2
## 9           Thai Airways               8                     4
## 10              Egyptair               8                     3
## 47       Cathay Pacific*               0                     0
## 48               Finnair               1                     0
## 49              Gulf Air               1                     0
## 50     Hawaiian Airlines               0                     0
## 51         Kenya Airways               2                     0
## 52               Qantas*               1                     0
## 53                  SAS*               5                     0
## 54    Southwest Airlines               1                     0
## 55    TAP - Air Portugal               0                     0
## 56       Virgin Atlantic               1                     0
##    fatalities_85_99 PDI_country
## 1               535          80
## 2               520          54
## 3               425          60
## 4               407          40
## 5               329          77
## 6               323          67
## 7               319          40
## 8               313          80
## 9               308          64
## 10              282          80
## 47                0          68
## 48                0          33
## 49                0          80
## 50                0          40
## 51                0          64
## 52                0          36
## 53                0          31
## 54                0          40
## 55                0          63
## 56                0          35

Lets plot fatalities vs PDI

ggplot(airline_safety_85_99, aes(x = PDI_country, y = fatalities_85_99)) + geom_point()

Lets plot indidents vs PDI plots

ggplot(airline_safety_85_99, aes(x = PDI_country, y = incidents_85_99)) + geom_point()

Generally speaking from the 2 graphs above we see that pilots in airlines based from countries with higher PDIs are more susceptible to causing fatalities in their accidents as well as more likley to be involved in an incident between 1985-1999 In the second chart it is more noticeable with the exception of the airlines based in the United States having a larged amount of incidents. There are also a couple of airlines when rather high PDIs that happened to have zero incidents and zero fatalities. I would suggest there is a slight trend based off of this current subset that airlines located in countries with higher PDIs are more susceptible to incidents that those with lower PDIs.