This data set is from Shana’s Airline Safety post.
The data is located at: https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv
The analysis asked for is to sort and compare the number of airlines and their fatalities given to see any trends there.
The first step is to import the csv file from the website where the data is located and create a dataset out of it.
df <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv", header = TRUE)
head(df)
## airline avail_seat_km_per_week incidents_85_99
## 1 Aer Lingus 320906734 2
## 2 Aeroflot* 1197672318 76
## 3 Aerolineas Argentinas 385803648 6
## 4 Aeromexico* 596871813 3
## 5 Air Canada 1865253802 2
## 6 Air France 3004002661 14
## fatal_accidents_85_99 fatalities_85_99 incidents_00_14 fatal_accidents_00_14
## 1 0 0 0 0
## 2 14 128 6 1
## 3 0 0 1 0
## 4 1 64 5 0
## 5 0 0 2 0
## 6 4 79 6 2
## fatalities_00_14
## 1 0
## 2 88
## 3 0
## 4 0
## 5 0
## 6 337
df2 <- as_tibble(df)
head(df2)
## # A tibble: 6 x 8
## airline avail_seat_km_p… incidents_85_99 fatal_accidents… fatalities_85_99
## <fct> <dbl> <int> <int> <int>
## 1 Aer Li… 320906734 2 0 0
## 2 Aerofl… 1197672318 76 14 128
## 3 Aeroli… 385803648 6 0 0
## 4 Aerome… 596871813 3 1 64
## 5 Air Ca… 1865253802 2 0 0
## 6 Air Fr… 3004002661 14 4 79
## # … with 3 more variables: incidents_00_14 <int>, fatal_accidents_00_14 <int>,
## # fatalities_00_14 <int>
I’m creating 3 new columns to aggregate the data. Total Incidents, Total Fatal Accidents and Total Fatalities are the three new columns created with the mutate function.
df3 <- mutate(df2, totalIncidents = (incidents_85_99 + incidents_00_14), totalAccidents = (fatal_accidents_85_99 + fatal_accidents_00_14), totalFatalities = (fatalities_85_99 + fatalities_00_14))
head(df3)
## # A tibble: 6 x 11
## airline avail_seat_km_p… incidents_85_99 fatal_accidents… fatalities_85_99
## <fct> <dbl> <int> <int> <int>
## 1 Aer Li… 320906734 2 0 0
## 2 Aerofl… 1197672318 76 14 128
## 3 Aeroli… 385803648 6 0 0
## 4 Aerome… 596871813 3 1 64
## 5 Air Ca… 1865253802 2 0 0
## 6 Air Fr… 3004002661 14 4 79
## # … with 6 more variables: incidents_00_14 <int>, fatal_accidents_00_14 <int>,
## # fatalities_00_14 <int>, totalIncidents <int>, totalAccidents <int>,
## # totalFatalities <int>
Now that we have the new columns of data, we will now transform it from “wide” to something that is easier the use in R.
df4 <- gather(df3, "Situations", "Situation Count", 3:11)
head(df4, 10)
## # A tibble: 10 x 4
## airline avail_seat_km_per_week Situations `Situation Coun…
## <fct> <dbl> <chr> <int>
## 1 Aer Lingus 320906734 incidents_85_99 2
## 2 Aeroflot* 1197672318 incidents_85_99 76
## 3 Aerolineas Argentinas 385803648 incidents_85_99 6
## 4 Aeromexico* 596871813 incidents_85_99 3
## 5 Air Canada 1865253802 incidents_85_99 2
## 6 Air France 3004002661 incidents_85_99 14
## 7 Air India* 869253552 incidents_85_99 2
## 8 Air New Zealand* 710174817 incidents_85_99 3
## 9 Alaska Airlines* 965346773 incidents_85_99 5
## 10 Alitalia 698012498 incidents_85_99 7
Now that we had the data in a “long” format in R, we’ll look to answer the questions. First we will sort it by airline as requested. Then we’ll pull out the data for Total Fatalities.
df4 %>%
arrange(`airline`)
## # A tibble: 504 x 4
## airline avail_seat_km_per_week Situations `Situation Count`
## <fct> <dbl> <chr> <int>
## 1 Aer Lingus 320906734 incidents_85_99 2
## 2 Aer Lingus 320906734 fatal_accidents_85_99 0
## 3 Aer Lingus 320906734 fatalities_85_99 0
## 4 Aer Lingus 320906734 incidents_00_14 0
## 5 Aer Lingus 320906734 fatal_accidents_00_14 0
## 6 Aer Lingus 320906734 fatalities_00_14 0
## 7 Aer Lingus 320906734 totalIncidents 2
## 8 Aer Lingus 320906734 totalAccidents 0
## 9 Aer Lingus 320906734 totalFatalities 0
## 10 Aeroflot* 1197672318 incidents_85_99 76
## # … with 494 more rows
df4 %>%
filter(`Situations` == "totalFatalities")
## # A tibble: 56 x 4
## airline avail_seat_km_per_week Situations `Situation Coun…
## <fct> <dbl> <chr> <int>
## 1 Aer Lingus 320906734 totalFatalities 0
## 2 Aeroflot* 1197672318 totalFatalities 216
## 3 Aerolineas Argentinas 385803648 totalFatalities 0
## 4 Aeromexico* 596871813 totalFatalities 64
## 5 Air Canada 1865253802 totalFatalities 0
## 6 Air France 3004002661 totalFatalities 416
## 7 Air India* 869253552 totalFatalities 487
## 8 Air New Zealand* 710174817 totalFatalities 7
## 9 Alaska Airlines* 965346773 totalFatalities 88
## 10 Alitalia 698012498 totalFatalities 50
## # … with 46 more rows
Now that we have the Total Fatalities pulled out, we’ll graph the data to see which airlines have alot of fatalities and which airlines have very few fatalities.
df5 <-
select(df4, "airline", "Situation Count") %>%
rename("Fatalities" = "Situation Count",
"Airlines" = "airline")
head(df5)
## # A tibble: 6 x 2
## Airlines Fatalities
## <fct> <int>
## 1 Aer Lingus 2
## 2 Aeroflot* 76
## 3 Aerolineas Argentinas 6
## 4 Aeromexico* 3
## 5 Air Canada 2
## 6 Air France 14
#We have the data but now we'll reorder it so it's hightest fatalities to least fatalities
#df5$Airlines <-
#fct_rev(df5$Airlines)
ggplot(data = df5, aes(y = Airlines, x = Fatalities)) +
geom_line() + theme(axis.text = element_text(color = "grey50", size = 4))
The graph shows that China airlines had the most fatalities by far.
This data set is from Rachel’s “Who eats the food we grow?” post.
The data is located at: https://www.kaggle.com/dorbicycle/world-foodfeed-production https://www.kaggle.com/dorbicycle/world-foodfeed-production/download
But since you have to login to get the data, I have also placed it at: https://github.com/mjgons/DATA607/blob/master/FAO.csv
The question to answer was, “I think the most interesting thing to do would be to figure out if feed has overtaken food in any areas or food items.”
The first step is to import the csv file from the website where the data is located and create a dataset out of it.
data1 <- read.csv("FAO.csv", header = TRUE)
#converting it to a tibble
data2 <- as_tibble(data1)
head(data2)
## # A tibble: 6 x 63
## Area.Abbreviati… Area.Code Area Item.Code Item Element.Code Element Unit
## <fct> <int> <fct> <int> <fct> <int> <fct> <fct>
## 1 AFG 2 Afgh… 2511 Whea… 5142 Food 1000…
## 2 AFG 2 Afgh… 2805 Rice… 5142 Food 1000…
## 3 AFG 2 Afgh… 2513 Barl… 5521 Feed 1000…
## 4 AFG 2 Afgh… 2513 Barl… 5142 Food 1000…
## 5 AFG 2 Afgh… 2514 Maiz… 5521 Feed 1000…
## 6 AFG 2 Afgh… 2514 Maiz… 5142 Food 1000…
## # … with 55 more variables: latitude <dbl>, longitude <dbl>, Y1961 <int>,
## # Y1962 <int>, Y1963 <int>, Y1964 <int>, Y1965 <int>, Y1966 <int>,
## # Y1967 <int>, Y1968 <int>, Y1969 <int>, Y1970 <int>, Y1971 <int>,
## # Y1972 <int>, Y1973 <int>, Y1974 <int>, Y1975 <int>, Y1976 <int>,
## # Y1977 <int>, Y1978 <int>, Y1979 <int>, Y1980 <int>, Y1981 <int>,
## # Y1982 <int>, Y1983 <int>, Y1984 <int>, Y1985 <int>, Y1986 <int>,
## # Y1987 <int>, Y1988 <int>, Y1989 <int>, Y1990 <int>, Y1991 <int>,
## # Y1992 <int>, Y1993 <int>, Y1994 <int>, Y1995 <int>, Y1996 <int>,
## # Y1997 <int>, Y1998 <int>, Y1999 <int>, Y2000 <int>, Y2001 <int>,
## # Y2002 <int>, Y2003 <int>, Y2004 <int>, Y2005 <int>, Y2006 <int>,
## # Y2007 <int>, Y2008 <int>, Y2009 <int>, Y2010 <int>, Y2011 <int>,
## # Y2012 <int>, Y2013 <int>
Now that we have the data, we need to convert it from “wide” to “long” to be able to do analysis on it in R. I’m using the “pivot_longer” function this time instead of the “gather” function.
data3 <-
data2 %>%
pivot_longer(
starts_with("y"),
names_to = "year",
values_to = "tonnes1k"
)
head(data3)
## # A tibble: 6 x 12
## Area.Abbreviati… Area.Code Area Item.Code Item Element.Code Element Unit
## <fct> <int> <fct> <int> <fct> <int> <fct> <fct>
## 1 AFG 2 Afgh… 2511 Whea… 5142 Food 1000…
## 2 AFG 2 Afgh… 2511 Whea… 5142 Food 1000…
## 3 AFG 2 Afgh… 2511 Whea… 5142 Food 1000…
## 4 AFG 2 Afgh… 2511 Whea… 5142 Food 1000…
## 5 AFG 2 Afgh… 2511 Whea… 5142 Food 1000…
## 6 AFG 2 Afgh… 2511 Whea… 5142 Food 1000…
## # … with 4 more variables: latitude <dbl>, longitude <dbl>, year <chr>,
## # tonnes1k <int>
Now we will separate the year to remove the leading “Y” from the year.
data4 <-
data3 %>%
separate(
col = year,
into = c("Y", "year"),
sep = "Y",
convert = TRUE
)
head(data4)
## # A tibble: 6 x 13
## Area.Abbreviati… Area.Code Area Item.Code Item Element.Code Element Unit
## <fct> <int> <fct> <int> <fct> <int> <fct> <fct>
## 1 AFG 2 Afgh… 2511 Whea… 5142 Food 1000…
## 2 AFG 2 Afgh… 2511 Whea… 5142 Food 1000…
## 3 AFG 2 Afgh… 2511 Whea… 5142 Food 1000…
## 4 AFG 2 Afgh… 2511 Whea… 5142 Food 1000…
## 5 AFG 2 Afgh… 2511 Whea… 5142 Food 1000…
## 6 AFG 2 Afgh… 2511 Whea… 5142 Food 1000…
## # … with 5 more variables: latitude <dbl>, longitude <dbl>, Y <lgl>,
## # year <int>, tonnes1k <int>
Let’s now remove columns we don’t need in the dataset
data5 <- data4
data5 <- select(data4, -1:-2)
data5 <- select(data5, -2:-4)
data5 <- select(data5, -3:-5)
data5 <- select(data5, -3)
head(data5)
## # A tibble: 6 x 4
## Area Element year tonnes1k
## <fct> <fct> <int> <int>
## 1 Afghanistan Food 1961 1928
## 2 Afghanistan Food 1962 1904
## 3 Afghanistan Food 1963 1666
## 4 Afghanistan Food 1964 1950
## 5 Afghanistan Food 1965 2001
## 6 Afghanistan Food 1966 1808
summary(data5)
## Area Element year tonnes1k
## Spain : 7950 Feed:209297 Min. :1961 Min. : -246
## Italy : 7844 Food:928984 1st Qu.:1974 1st Qu.: 0
## Germany : 7791 Median :1987 Median : 3
## China, mainland: 7738 Mean :1987 Mean : 371
## Greece : 7579 3rd Qu.:2000 3rd Qu.: 49
## Japan : 7579 Max. :2013 Max. :489299
## (Other) :1091800 NA's :117450
Now, let’s plot Food and Feed and see the overall amount produced by year.
ggplot(data5, aes(x = year, y = tonnes1k)) +
geom_line()
Now lets look at a specific country
data6 <- data5
data6 <- filter(data6, Area == "Afghanistan", Element == "Feed")
ggplot(data6, aes(x = year, y = tonnes1k)) +
geom_line()
data7 <- data5
data7 <- filter(data7, Area == "Afghanistan", Element == "Food")
ggplot(data7, aes(x = year, y = tonnes1k)) +
geom_line()
We can see for Afghanistan that in no year did Feed overtake Food.
#Dataset 3
This dataset is based on post by Richard on Broadband Services within NYS.
The dataset is located at: https://data.ny.gov/api/views/sjc6-ftj4/rows.csv
The analysis asked to look at the municipalities that have lower rates of broadband and the number of housing units affected.
The first step is to import the csv file from the website where the data is located and create a dataset out of it.
nys1 <- read.csv("https://data.ny.gov/api/views/sjc6-ftj4/rows.csv", header = TRUE)
head(nys1)
## GNIS.ID Municipality.Name Municipality.Type X2010.Muni.Population
## 1 942159 Adams Village 1775
## 2 978655 Adams Town 5143
## 3 942177 Addison Village 1763
## 4 978656 Addison Town 2595
## 5 942198 Afton Village 822
## 6 978657 Afton Town 2851
## X2010.Muni.Housing.Units Muni.Area..sq.mi. County REDC.Region
## 1 793 1.5 Jefferson North Country Region
## 2 2126 42.4 Jefferson North Country Region
## 3 770 1.9 Steuben Southern Tier Region
## 4 1159 25.7 Steuben Southern Tier Region
## 5 430 1.7 Chenango Southern Tier Region
## 6 1457 46.7 Chenango Southern Tier Region
## X..Cable.Providers X..Hse.Units.Cable X..Hse.Units.Cable.1
## 1 1 760 96
## 2 1 1800 85
## 3 1 740 96
## 4 1 950 82
## 5 1 410 95
## 6 2 1100 75
## X..of.DSL.Providers X..Hse.Units.DSL X..Hse.Units.DSL.1 X..Fiber.Providers
## 1 1 760 96 1
## 2 1 2000 94 1
## 3 1 740 96 1
## 4 1 1000 86 1
## 5 1 410 95 0
## 6 2 1400 96 0
## X..Hse.Units.Fiber X..Hse.Units.Fiber.1 X..Wireline.Providers
## 1 0 0 3
## 2 0 0 3
## 3 0 0 3
## 4 0 0 3
## 5 0 0 2
## 6 0 0 4
## X..Hse.Units.Wireline X..Hse.Units.Wireline.1 X..Wireless.Providers
## 1 760 96 3
## 2 2000 94 4
## 3 740 96 2
## 4 1000 86 3
## 5 410 95 4
## 6 1400 96 6
## X..Hse.Units.Wireless X..Hse.Units.Wireless.1 X..Satellite.Providers
## 1 760 96 4
## 2 2000 94 4
## 3 740 96 4
## 4 1100 95 4
## 5 410 95 4
## 6 1400 96 4
nys2 <- as_tibble(nys1)
head(nys2)
## # A tibble: 6 x 24
## GNIS.ID Municipality.Na… Municipality.Ty… X2010.Muni.Popu… X2010.Muni.Hous…
## <int> <fct> <fct> <int> <int>
## 1 942159 Adams Village 1775 793
## 2 978655 Adams Town 5143 2126
## 3 942177 Addison Village 1763 770
## 4 978656 Addison Town 2595 1159
## 5 942198 Afton Village 822 430
## 6 978657 Afton Town 2851 1457
## # … with 19 more variables: Muni.Area..sq.mi. <dbl>, County <fct>,
## # REDC.Region <fct>, X..Cable.Providers <int>, X..Hse.Units.Cable <int>,
## # X..Hse.Units.Cable.1 <int>, X..of.DSL.Providers <int>,
## # X..Hse.Units.DSL <int>, X..Hse.Units.DSL.1 <int>, X..Fiber.Providers <int>,
## # X..Hse.Units.Fiber <int>, X..Hse.Units.Fiber.1 <int>,
## # X..Wireline.Providers <int>, X..Hse.Units.Wireline <int>,
## # X..Hse.Units.Wireline.1 <int>, X..Wireless.Providers <int>,
## # X..Hse.Units.Wireless <int>, X..Hse.Units.Wireless.1 <int>,
## # X..Satellite.Providers <int>
Now we’ll remove the columns not needed.
nys3 <- nys2
nys3 <- select(nys3, (-24))
nys3 <- select(nys3, (-23))
nys3 <- select(nys3, (-21))
nys3 <- select(nys3, (-20))
nys3 <- select(nys3, (-18))
nys3 <- select(nys3, (-17))
nys3 <- select(nys3, (-15))
nys3 <- select(nys3, (-14))
nys3 <- select(nys3, (-12))
nys3 <- select(nys3, (-11))
nys3 <- select(nys3, (-9))
head(nys3)
## # A tibble: 6 x 13
## GNIS.ID Municipality.Na… Municipality.Ty… X2010.Muni.Popu… X2010.Muni.Hous…
## <int> <fct> <fct> <int> <int>
## 1 942159 Adams Village 1775 793
## 2 978655 Adams Town 5143 2126
## 3 942177 Addison Village 1763 770
## 4 978656 Addison Town 2595 1159
## 5 942198 Afton Village 822 430
## 6 978657 Afton Town 2851 1457
## # … with 8 more variables: Muni.Area..sq.mi. <dbl>, County <fct>,
## # REDC.Region <fct>, X..Hse.Units.Cable <int>, X..Hse.Units.DSL <int>,
## # X..Hse.Units.Fiber <int>, X..Hse.Units.Wireline <int>,
## # X..Hse.Units.Wireless <int>
Now we are going to find the maximum number of broadband households in each city/town/village.
nys4 <- nys3
nys4 %>%
rowwise() %>%
mutate(max = max(X..Hse.Units.Cable, X..Hse.Units.DSL, X..Hse.Units.Fiber, X..Hse.Units.Wireline, X..Hse.Units.Wireless))
## # A tibble: 1,635 x 14
## # Rowwise:
## GNIS.ID Municipality.Na… Municipality.Ty… X2010.Muni.Popu… X2010.Muni.Hous…
## <int> <fct> <fct> <int> <int>
## 1 942159 Adams Village 1775 793
## 2 978655 Adams Town 5143 2126
## 3 942177 Addison Village 1763 770
## 4 978656 Addison Town 2595 1159
## 5 942198 Afton Village 822 430
## 6 978657 Afton Town 2851 1457
## 7 979664 Airmont Village 8628 2791
## 8 942224 Akron Village 2868 1382
## 9 978658 Alabama Town 1869 765
## 10 974099 Albany County 304204 137739
## # … with 1,625 more rows, and 9 more variables: Muni.Area..sq.mi. <dbl>,
## # County <fct>, REDC.Region <fct>, X..Hse.Units.Cable <int>,
## # X..Hse.Units.DSL <int>, X..Hse.Units.Fiber <int>,
## # X..Hse.Units.Wireline <int>, X..Hse.Units.Wireless <int>, max <int>
head(nys4)
## # A tibble: 6 x 13
## GNIS.ID Municipality.Na… Municipality.Ty… X2010.Muni.Popu… X2010.Muni.Hous…
## <int> <fct> <fct> <int> <int>
## 1 942159 Adams Village 1775 793
## 2 978655 Adams Town 5143 2126
## 3 942177 Addison Village 1763 770
## 4 978656 Addison Town 2595 1159
## 5 942198 Afton Village 822 430
## 6 978657 Afton Town 2851 1457
## # … with 8 more variables: Muni.Area..sq.mi. <dbl>, County <fct>,
## # REDC.Region <fct>, X..Hse.Units.Cable <int>, X..Hse.Units.DSL <int>,
## # X..Hse.Units.Fiber <int>, X..Hse.Units.Wireline <int>,
## # X..Hse.Units.Wireless <int>
Now that we have the maximum value by area we can now calculate the broadband percentage by area.
#nys5 <- nys4 %>%
# mutate("BBpct" = (`max` / `X2010.Muni.Population`))