Dataset 1

This data set is from Shana’s Airline Safety post.

The data is located at: https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv

The analysis asked for is to sort and compare the number of airlines and their fatalities given to see any trends there.

The first step is to import the csv file from the website where the data is located and create a dataset out of it.

df <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv", header = TRUE)
head(df)
##                 airline avail_seat_km_per_week incidents_85_99
## 1            Aer Lingus              320906734               2
## 2             Aeroflot*             1197672318              76
## 3 Aerolineas Argentinas              385803648               6
## 4           Aeromexico*              596871813               3
## 5            Air Canada             1865253802               2
## 6            Air France             3004002661              14
##   fatal_accidents_85_99 fatalities_85_99 incidents_00_14 fatal_accidents_00_14
## 1                     0                0               0                     0
## 2                    14              128               6                     1
## 3                     0                0               1                     0
## 4                     1               64               5                     0
## 5                     0                0               2                     0
## 6                     4               79               6                     2
##   fatalities_00_14
## 1                0
## 2               88
## 3                0
## 4                0
## 5                0
## 6              337
df2 <- as_tibble(df)
head(df2)
## # A tibble: 6 x 8
##   airline avail_seat_km_p… incidents_85_99 fatal_accidents… fatalities_85_99
##   <fct>              <dbl>           <int>            <int>            <int>
## 1 Aer Li…        320906734               2                0                0
## 2 Aerofl…       1197672318              76               14              128
## 3 Aeroli…        385803648               6                0                0
## 4 Aerome…        596871813               3                1               64
## 5 Air Ca…       1865253802               2                0                0
## 6 Air Fr…       3004002661              14                4               79
## # … with 3 more variables: incidents_00_14 <int>, fatal_accidents_00_14 <int>,
## #   fatalities_00_14 <int>

Creating new data columns to consolidate the data

I’m creating 3 new columns to aggregate the data. Total Incidents, Total Fatal Accidents and Total Fatalities are the three new columns created with the mutate function.

df3 <- mutate(df2, totalIncidents = (incidents_85_99 + incidents_00_14), totalAccidents = (fatal_accidents_85_99 + fatal_accidents_00_14), totalFatalities = (fatalities_85_99 + fatalities_00_14))
head(df3)
## # A tibble: 6 x 11
##   airline avail_seat_km_p… incidents_85_99 fatal_accidents… fatalities_85_99
##   <fct>              <dbl>           <int>            <int>            <int>
## 1 Aer Li…        320906734               2                0                0
## 2 Aerofl…       1197672318              76               14              128
## 3 Aeroli…        385803648               6                0                0
## 4 Aerome…        596871813               3                1               64
## 5 Air Ca…       1865253802               2                0                0
## 6 Air Fr…       3004002661              14                4               79
## # … with 6 more variables: incidents_00_14 <int>, fatal_accidents_00_14 <int>,
## #   fatalities_00_14 <int>, totalIncidents <int>, totalAccidents <int>,
## #   totalFatalities <int>

Now that we have the new columns of data, we will now transform it from “wide” to something that is easier the use in R.

df4 <- gather(df3, "Situations", "Situation Count", 3:11)
head(df4, 10)
## # A tibble: 10 x 4
##    airline               avail_seat_km_per_week Situations      `Situation Coun…
##    <fct>                                  <dbl> <chr>                      <int>
##  1 Aer Lingus                         320906734 incidents_85_99                2
##  2 Aeroflot*                         1197672318 incidents_85_99               76
##  3 Aerolineas Argentinas              385803648 incidents_85_99                6
##  4 Aeromexico*                        596871813 incidents_85_99                3
##  5 Air Canada                        1865253802 incidents_85_99                2
##  6 Air France                        3004002661 incidents_85_99               14
##  7 Air India*                         869253552 incidents_85_99                2
##  8 Air New Zealand*                   710174817 incidents_85_99                3
##  9 Alaska Airlines*                   965346773 incidents_85_99                5
## 10 Alitalia                           698012498 incidents_85_99                7

Now that we had the data in a “long” format in R, we’ll look to answer the questions. First we will sort it by airline as requested. Then we’ll pull out the data for Total Fatalities.

df4 %>%
  arrange(`airline`)
## # A tibble: 504 x 4
##    airline    avail_seat_km_per_week Situations            `Situation Count`
##    <fct>                       <dbl> <chr>                             <int>
##  1 Aer Lingus              320906734 incidents_85_99                       2
##  2 Aer Lingus              320906734 fatal_accidents_85_99                 0
##  3 Aer Lingus              320906734 fatalities_85_99                      0
##  4 Aer Lingus              320906734 incidents_00_14                       0
##  5 Aer Lingus              320906734 fatal_accidents_00_14                 0
##  6 Aer Lingus              320906734 fatalities_00_14                      0
##  7 Aer Lingus              320906734 totalIncidents                        2
##  8 Aer Lingus              320906734 totalAccidents                        0
##  9 Aer Lingus              320906734 totalFatalities                       0
## 10 Aeroflot*              1197672318 incidents_85_99                      76
## # … with 494 more rows
df4 %>%
  filter(`Situations` == "totalFatalities")
## # A tibble: 56 x 4
##    airline               avail_seat_km_per_week Situations      `Situation Coun…
##    <fct>                                  <dbl> <chr>                      <int>
##  1 Aer Lingus                         320906734 totalFatalities                0
##  2 Aeroflot*                         1197672318 totalFatalities              216
##  3 Aerolineas Argentinas              385803648 totalFatalities                0
##  4 Aeromexico*                        596871813 totalFatalities               64
##  5 Air Canada                        1865253802 totalFatalities                0
##  6 Air France                        3004002661 totalFatalities              416
##  7 Air India*                         869253552 totalFatalities              487
##  8 Air New Zealand*                   710174817 totalFatalities                7
##  9 Alaska Airlines*                   965346773 totalFatalities               88
## 10 Alitalia                           698012498 totalFatalities               50
## # … with 46 more rows

Now that we have the Total Fatalities pulled out, we’ll graph the data to see which airlines have alot of fatalities and which airlines have very few fatalities.

df5 <- 
  select(df4, "airline", "Situation Count") %>%
  rename("Fatalities" = "Situation Count",
         "Airlines" = "airline")

head(df5)
## # A tibble: 6 x 2
##   Airlines              Fatalities
##   <fct>                      <int>
## 1 Aer Lingus                     2
## 2 Aeroflot*                     76
## 3 Aerolineas Argentinas          6
## 4 Aeromexico*                    3
## 5 Air Canada                     2
## 6 Air France                    14
#We have the data but now we'll reorder it so it's hightest fatalities to least fatalities

#df5$Airlines <-
#fct_rev(df5$Airlines)
      

ggplot(data = df5, aes(y = Airlines, x = Fatalities)) +
geom_line() + theme(axis.text = element_text(color = "grey50", size = 4))

The graph shows that China airlines had the most fatalities by far.

Dataset 2

This data set is from Rachel’s “Who eats the food we grow?” post.

The data is located at: https://www.kaggle.com/dorbicycle/world-foodfeed-production https://www.kaggle.com/dorbicycle/world-foodfeed-production/download

But since you have to login to get the data, I have also placed it at: https://github.com/mjgons/DATA607/blob/master/FAO.csv

The question to answer was, “I think the most interesting thing to do would be to figure out if feed has overtaken food in any areas or food items.”

The first step is to import the csv file from the website where the data is located and create a dataset out of it.

data1 <- read.csv("FAO.csv", header = TRUE)
#converting it to a tibble
data2 <- as_tibble(data1)
head(data2)
## # A tibble: 6 x 63
##   Area.Abbreviati… Area.Code Area  Item.Code Item  Element.Code Element Unit 
##   <fct>                <int> <fct>     <int> <fct>        <int> <fct>   <fct>
## 1 AFG                      2 Afgh…      2511 Whea…         5142 Food    1000…
## 2 AFG                      2 Afgh…      2805 Rice…         5142 Food    1000…
## 3 AFG                      2 Afgh…      2513 Barl…         5521 Feed    1000…
## 4 AFG                      2 Afgh…      2513 Barl…         5142 Food    1000…
## 5 AFG                      2 Afgh…      2514 Maiz…         5521 Feed    1000…
## 6 AFG                      2 Afgh…      2514 Maiz…         5142 Food    1000…
## # … with 55 more variables: latitude <dbl>, longitude <dbl>, Y1961 <int>,
## #   Y1962 <int>, Y1963 <int>, Y1964 <int>, Y1965 <int>, Y1966 <int>,
## #   Y1967 <int>, Y1968 <int>, Y1969 <int>, Y1970 <int>, Y1971 <int>,
## #   Y1972 <int>, Y1973 <int>, Y1974 <int>, Y1975 <int>, Y1976 <int>,
## #   Y1977 <int>, Y1978 <int>, Y1979 <int>, Y1980 <int>, Y1981 <int>,
## #   Y1982 <int>, Y1983 <int>, Y1984 <int>, Y1985 <int>, Y1986 <int>,
## #   Y1987 <int>, Y1988 <int>, Y1989 <int>, Y1990 <int>, Y1991 <int>,
## #   Y1992 <int>, Y1993 <int>, Y1994 <int>, Y1995 <int>, Y1996 <int>,
## #   Y1997 <int>, Y1998 <int>, Y1999 <int>, Y2000 <int>, Y2001 <int>,
## #   Y2002 <int>, Y2003 <int>, Y2004 <int>, Y2005 <int>, Y2006 <int>,
## #   Y2007 <int>, Y2008 <int>, Y2009 <int>, Y2010 <int>, Y2011 <int>,
## #   Y2012 <int>, Y2013 <int>

Now that we have the data, we need to convert it from “wide” to “long” to be able to do analysis on it in R. I’m using the “pivot_longer” function this time instead of the “gather” function.

data3 <-
  data2 %>%                    
  pivot_longer(
    starts_with("y"), 
    names_to = "year",  
    values_to = "tonnes1k"             
)
head(data3)
## # A tibble: 6 x 12
##   Area.Abbreviati… Area.Code Area  Item.Code Item  Element.Code Element Unit 
##   <fct>                <int> <fct>     <int> <fct>        <int> <fct>   <fct>
## 1 AFG                      2 Afgh…      2511 Whea…         5142 Food    1000…
## 2 AFG                      2 Afgh…      2511 Whea…         5142 Food    1000…
## 3 AFG                      2 Afgh…      2511 Whea…         5142 Food    1000…
## 4 AFG                      2 Afgh…      2511 Whea…         5142 Food    1000…
## 5 AFG                      2 Afgh…      2511 Whea…         5142 Food    1000…
## 6 AFG                      2 Afgh…      2511 Whea…         5142 Food    1000…
## # … with 4 more variables: latitude <dbl>, longitude <dbl>, year <chr>,
## #   tonnes1k <int>

Now we will separate the year to remove the leading “Y” from the year.

data4 <-
  data3 %>% 
  separate(
    col = year,
    into = c("Y", "year"),
    sep = "Y",
    convert = TRUE
)

head(data4)
## # A tibble: 6 x 13
##   Area.Abbreviati… Area.Code Area  Item.Code Item  Element.Code Element Unit 
##   <fct>                <int> <fct>     <int> <fct>        <int> <fct>   <fct>
## 1 AFG                      2 Afgh…      2511 Whea…         5142 Food    1000…
## 2 AFG                      2 Afgh…      2511 Whea…         5142 Food    1000…
## 3 AFG                      2 Afgh…      2511 Whea…         5142 Food    1000…
## 4 AFG                      2 Afgh…      2511 Whea…         5142 Food    1000…
## 5 AFG                      2 Afgh…      2511 Whea…         5142 Food    1000…
## 6 AFG                      2 Afgh…      2511 Whea…         5142 Food    1000…
## # … with 5 more variables: latitude <dbl>, longitude <dbl>, Y <lgl>,
## #   year <int>, tonnes1k <int>

Let’s now remove columns we don’t need in the dataset

data5 <- data4
data5 <- select(data4, -1:-2)
data5 <- select(data5, -2:-4)           
data5 <- select(data5, -3:-5)
data5 <- select(data5, -3)
head(data5)
## # A tibble: 6 x 4
##   Area        Element  year tonnes1k
##   <fct>       <fct>   <int>    <int>
## 1 Afghanistan Food     1961     1928
## 2 Afghanistan Food     1962     1904
## 3 Afghanistan Food     1963     1666
## 4 Afghanistan Food     1964     1950
## 5 Afghanistan Food     1965     2001
## 6 Afghanistan Food     1966     1808
summary(data5)
##               Area         Element            year         tonnes1k     
##  Spain          :   7950   Feed:209297   Min.   :1961   Min.   :  -246  
##  Italy          :   7844   Food:928984   1st Qu.:1974   1st Qu.:     0  
##  Germany        :   7791                 Median :1987   Median :     3  
##  China, mainland:   7738                 Mean   :1987   Mean   :   371  
##  Greece         :   7579                 3rd Qu.:2000   3rd Qu.:    49  
##  Japan          :   7579                 Max.   :2013   Max.   :489299  
##  (Other)        :1091800                                NA's   :117450

Now, let’s plot Food and Feed and see the overall amount produced by year.

ggplot(data5, aes(x = year, y = tonnes1k)) +
geom_line() 

Now lets look at a specific country

data6 <- data5 
data6 <- filter(data6, Area == "Afghanistan", Element == "Feed")

ggplot(data6, aes(x = year, y = tonnes1k)) +
geom_line() 

data7 <- data5
data7 <- filter(data7, Area == "Afghanistan", Element == "Food")

ggplot(data7, aes(x = year, y = tonnes1k)) +
geom_line() 

We can see for Afghanistan that in no year did Feed overtake Food.

#Dataset 3

This dataset is based on post by Richard on Broadband Services within NYS.

The dataset is located at: https://data.ny.gov/api/views/sjc6-ftj4/rows.csv

The analysis asked to look at the municipalities that have lower rates of broadband and the number of housing units affected.

The first step is to import the csv file from the website where the data is located and create a dataset out of it.

nys1 <- read.csv("https://data.ny.gov/api/views/sjc6-ftj4/rows.csv", header = TRUE)
head(nys1)
##   GNIS.ID Municipality.Name Municipality.Type X2010.Muni.Population
## 1  942159             Adams           Village                  1775
## 2  978655             Adams              Town                  5143
## 3  942177           Addison           Village                  1763
## 4  978656           Addison              Town                  2595
## 5  942198             Afton           Village                   822
## 6  978657             Afton              Town                  2851
##   X2010.Muni.Housing.Units Muni.Area..sq.mi.    County          REDC.Region
## 1                      793               1.5 Jefferson North Country Region
## 2                     2126              42.4 Jefferson North Country Region
## 3                      770               1.9   Steuben Southern Tier Region
## 4                     1159              25.7   Steuben Southern Tier Region
## 5                      430               1.7  Chenango Southern Tier Region
## 6                     1457              46.7  Chenango Southern Tier Region
##   X..Cable.Providers X..Hse.Units.Cable X..Hse.Units.Cable.1
## 1                  1                760                   96
## 2                  1               1800                   85
## 3                  1                740                   96
## 4                  1                950                   82
## 5                  1                410                   95
## 6                  2               1100                   75
##   X..of.DSL.Providers X..Hse.Units.DSL X..Hse.Units.DSL.1 X..Fiber.Providers
## 1                   1              760                 96                  1
## 2                   1             2000                 94                  1
## 3                   1              740                 96                  1
## 4                   1             1000                 86                  1
## 5                   1              410                 95                  0
## 6                   2             1400                 96                  0
##   X..Hse.Units.Fiber X..Hse.Units.Fiber.1 X..Wireline.Providers
## 1                  0                    0                     3
## 2                  0                    0                     3
## 3                  0                    0                     3
## 4                  0                    0                     3
## 5                  0                    0                     2
## 6                  0                    0                     4
##   X..Hse.Units.Wireline X..Hse.Units.Wireline.1 X..Wireless.Providers
## 1                   760                      96                     3
## 2                  2000                      94                     4
## 3                   740                      96                     2
## 4                  1000                      86                     3
## 5                   410                      95                     4
## 6                  1400                      96                     6
##   X..Hse.Units.Wireless X..Hse.Units.Wireless.1 X..Satellite.Providers
## 1                   760                      96                      4
## 2                  2000                      94                      4
## 3                   740                      96                      4
## 4                  1100                      95                      4
## 5                   410                      95                      4
## 6                  1400                      96                      4
nys2 <- as_tibble(nys1)
head(nys2)
## # A tibble: 6 x 24
##   GNIS.ID Municipality.Na… Municipality.Ty… X2010.Muni.Popu… X2010.Muni.Hous…
##     <int> <fct>            <fct>                       <int>            <int>
## 1  942159 Adams            Village                      1775              793
## 2  978655 Adams            Town                         5143             2126
## 3  942177 Addison          Village                      1763              770
## 4  978656 Addison          Town                         2595             1159
## 5  942198 Afton            Village                       822              430
## 6  978657 Afton            Town                         2851             1457
## # … with 19 more variables: Muni.Area..sq.mi. <dbl>, County <fct>,
## #   REDC.Region <fct>, X..Cable.Providers <int>, X..Hse.Units.Cable <int>,
## #   X..Hse.Units.Cable.1 <int>, X..of.DSL.Providers <int>,
## #   X..Hse.Units.DSL <int>, X..Hse.Units.DSL.1 <int>, X..Fiber.Providers <int>,
## #   X..Hse.Units.Fiber <int>, X..Hse.Units.Fiber.1 <int>,
## #   X..Wireline.Providers <int>, X..Hse.Units.Wireline <int>,
## #   X..Hse.Units.Wireline.1 <int>, X..Wireless.Providers <int>,
## #   X..Hse.Units.Wireless <int>, X..Hse.Units.Wireless.1 <int>,
## #   X..Satellite.Providers <int>

Now we’ll remove the columns not needed.

nys3 <- nys2
nys3 <- select(nys3, (-24))
nys3 <- select(nys3, (-23))                              
nys3 <- select(nys3, (-21))  
nys3 <- select(nys3, (-20)) 
nys3 <- select(nys3, (-18)) 
nys3 <- select(nys3, (-17)) 
nys3 <- select(nys3, (-15)) 
nys3 <- select(nys3, (-14)) 
nys3 <- select(nys3, (-12)) 
nys3 <- select(nys3, (-11)) 
nys3 <- select(nys3, (-9)) 
head(nys3)
## # A tibble: 6 x 13
##   GNIS.ID Municipality.Na… Municipality.Ty… X2010.Muni.Popu… X2010.Muni.Hous…
##     <int> <fct>            <fct>                       <int>            <int>
## 1  942159 Adams            Village                      1775              793
## 2  978655 Adams            Town                         5143             2126
## 3  942177 Addison          Village                      1763              770
## 4  978656 Addison          Town                         2595             1159
## 5  942198 Afton            Village                       822              430
## 6  978657 Afton            Town                         2851             1457
## # … with 8 more variables: Muni.Area..sq.mi. <dbl>, County <fct>,
## #   REDC.Region <fct>, X..Hse.Units.Cable <int>, X..Hse.Units.DSL <int>,
## #   X..Hse.Units.Fiber <int>, X..Hse.Units.Wireline <int>,
## #   X..Hse.Units.Wireless <int>

Now we are going to find the maximum number of broadband households in each city/town/village.

nys4 <- nys3
nys4 %>%
  rowwise() %>%
  mutate(max = max(X..Hse.Units.Cable, X..Hse.Units.DSL, X..Hse.Units.Fiber, X..Hse.Units.Wireline, X..Hse.Units.Wireless))
## # A tibble: 1,635 x 14
## # Rowwise: 
##    GNIS.ID Municipality.Na… Municipality.Ty… X2010.Muni.Popu… X2010.Muni.Hous…
##      <int> <fct>            <fct>                       <int>            <int>
##  1  942159 Adams            Village                      1775              793
##  2  978655 Adams            Town                         5143             2126
##  3  942177 Addison          Village                      1763              770
##  4  978656 Addison          Town                         2595             1159
##  5  942198 Afton            Village                       822              430
##  6  978657 Afton            Town                         2851             1457
##  7  979664 Airmont          Village                      8628             2791
##  8  942224 Akron            Village                      2868             1382
##  9  978658 Alabama          Town                         1869              765
## 10  974099 Albany           County                     304204           137739
## # … with 1,625 more rows, and 9 more variables: Muni.Area..sq.mi. <dbl>,
## #   County <fct>, REDC.Region <fct>, X..Hse.Units.Cable <int>,
## #   X..Hse.Units.DSL <int>, X..Hse.Units.Fiber <int>,
## #   X..Hse.Units.Wireline <int>, X..Hse.Units.Wireless <int>, max <int>
head(nys4)
## # A tibble: 6 x 13
##   GNIS.ID Municipality.Na… Municipality.Ty… X2010.Muni.Popu… X2010.Muni.Hous…
##     <int> <fct>            <fct>                       <int>            <int>
## 1  942159 Adams            Village                      1775              793
## 2  978655 Adams            Town                         5143             2126
## 3  942177 Addison          Village                      1763              770
## 4  978656 Addison          Town                         2595             1159
## 5  942198 Afton            Village                       822              430
## 6  978657 Afton            Town                         2851             1457
## # … with 8 more variables: Muni.Area..sq.mi. <dbl>, County <fct>,
## #   REDC.Region <fct>, X..Hse.Units.Cable <int>, X..Hse.Units.DSL <int>,
## #   X..Hse.Units.Fiber <int>, X..Hse.Units.Wireline <int>,
## #   X..Hse.Units.Wireless <int>

Now that we have the maximum value by area we can now calculate the broadband percentage by area.

#nys5 <- nys4 %>%
#   mutate("BBpct" = (`max` / `X2010.Muni.Population`))