Shana Green

DATA 607 - Project 2

Due Date: 10/03/2020

DATA 607 – Project 2

The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.

Your task is to:

  1. Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets:

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the informatIon appears in the discussion item, so that you can practice tidying and transformations as described below.

Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]

Perform the analysis requested in the discussion item.

Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

  1. Please include in your homework submission, for each of the three chosen datasets:

The URL to the .Rmd file in your GitHub repository, and

The URL for your rpubs.com web page.

library(knitr)
library(stringr)
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

Dataset 1

For the first dataset, I wanted to use the original file from FiveThirtyEight article Should Travelers Avoid Flying Airplanes That Have Had Crashes in the Past?. I chose this as my discussion post and I wanted to compare the number of incidents, fatal incidents, and fatalities. I wanted to see if there was a correlation between the time frame of airline safety over a span of thirty years.

# Upload data set 
air_safe<-read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv", header = FALSE, stringsAsFactors = FALSE)

head(air_safe)
##                      V1                     V2              V3
## 1               airline avail_seat_km_per_week incidents_85_99
## 2            Aer Lingus              320906734               2
## 3             Aeroflot*             1197672318              76
## 4 Aerolineas Argentinas              385803648               6
## 5           Aeromexico*              596871813               3
## 6            Air Canada             1865253802               2
##                      V4               V5              V6                    V7
## 1 fatal_accidents_85_99 fatalities_85_99 incidents_00_14 fatal_accidents_00_14
## 2                     0                0               0                     0
## 3                    14              128               6                     1
## 4                     0                0               1                     0
## 5                     1               64               5                     0
## 6                     0                0               2                     0
##                 V8
## 1 fatalities_00_14
## 2                0
## 3               88
## 4                0
## 5                0
## 6                0
summary(air_safe)
##       V1                 V2                 V3                 V4           
##  Length:57          Length:57          Length:57          Length:57         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##       V5                 V6                 V7                 V8           
##  Length:57          Length:57          Length:57          Length:57         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character

As we can see here, we need to do some data cleaning before doing our analysis.

# Renaming row header
names(air_safe) <- air_safe[1,]
head(air_safe)
##                 airline avail_seat_km_per_week incidents_85_99
## 1               airline avail_seat_km_per_week incidents_85_99
## 2            Aer Lingus              320906734               2
## 3             Aeroflot*             1197672318              76
## 4 Aerolineas Argentinas              385803648               6
## 5           Aeromexico*              596871813               3
## 6            Air Canada             1865253802               2
##   fatal_accidents_85_99 fatalities_85_99 incidents_00_14 fatal_accidents_00_14
## 1 fatal_accidents_85_99 fatalities_85_99 incidents_00_14 fatal_accidents_00_14
## 2                     0                0               0                     0
## 3                    14              128               6                     1
## 4                     0                0               1                     0
## 5                     1               64               5                     0
## 6                     0                0               2                     0
##   fatalities_00_14
## 1 fatalities_00_14
## 2                0
## 3               88
## 4                0
## 5                0
## 6                0
#Remove row 1

air_safe <- air_safe[-c(1),]
head(air_safe)
##                 airline avail_seat_km_per_week incidents_85_99
## 2            Aer Lingus              320906734               2
## 3             Aeroflot*             1197672318              76
## 4 Aerolineas Argentinas              385803648               6
## 5           Aeromexico*              596871813               3
## 6            Air Canada             1865253802               2
## 7            Air France             3004002661              14
##   fatal_accidents_85_99 fatalities_85_99 incidents_00_14 fatal_accidents_00_14
## 2                     0                0               0                     0
## 3                    14              128               6                     1
## 4                     0                0               1                     0
## 5                     1               64               5                     0
## 6                     0                0               2                     0
## 7                     4               79               6                     2
##   fatalities_00_14
## 2                0
## 3               88
## 4                0
## 5                0
## 6                0
## 7              337

I noticed that in order for me to mutate and add additional variables, I have to convert the chr columns that are numeric into numeric.

# Convert from chr to numeric

air_safe$avail_seat_km_per_week <- as.numeric(as.character(air_safe$avail_seat_km_per_week))

air_safe$incidents_85_99<-as.numeric(as.character(air_safe$incidents_85_99))

air_safe$fatal_accidents_85_99<-as.numeric(as.character(air_safe$fatal_accidents_85_99))

air_safe$fatalities_85_99<-as.numeric(as.character(air_safe$fatalities_85_99))

air_safe$incidents_00_14<-as.numeric(as.character(air_safe$incidents_00_14))

air_safe$fatal_accidents_00_14<-as.numeric(as.character(air_safe$fatal_accidents_00_14))

air_safe$fatalities_00_14<-as.numeric(as.character(air_safe$fatalities_00_14))

head(air_safe)
##                 airline avail_seat_km_per_week incidents_85_99
## 2            Aer Lingus              320906734               2
## 3             Aeroflot*             1197672318              76
## 4 Aerolineas Argentinas              385803648               6
## 5           Aeromexico*              596871813               3
## 6            Air Canada             1865253802               2
## 7            Air France             3004002661              14
##   fatal_accidents_85_99 fatalities_85_99 incidents_00_14 fatal_accidents_00_14
## 2                     0                0               0                     0
## 3                    14              128               6                     1
## 4                     0                0               1                     0
## 5                     1               64               5                     0
## 6                     0                0               2                     0
## 7                     4               79               6                     2
##   fatalities_00_14
## 2                0
## 3               88
## 4                0
## 5                0
## 6                0
## 7              337

Let’s create a total tab that adds the total number of incidents, fatal accidents, and fatalities in 1985 to 1999:

air_safe <- air_safe %>%
    mutate(total_85_99 = incidents_85_99 + fatal_accidents_85_99 + fatalities_85_99, total_00_14 = incidents_00_14 + fatal_accidents_00_14 + fatalities_00_14)

head(air_safe)
##                 airline avail_seat_km_per_week incidents_85_99
## 1            Aer Lingus              320906734               2
## 2             Aeroflot*             1197672318              76
## 3 Aerolineas Argentinas              385803648               6
## 4           Aeromexico*              596871813               3
## 5            Air Canada             1865253802               2
## 6            Air France             3004002661              14
##   fatal_accidents_85_99 fatalities_85_99 incidents_00_14 fatal_accidents_00_14
## 1                     0                0               0                     0
## 2                    14              128               6                     1
## 3                     0                0               1                     0
## 4                     1               64               5                     0
## 5                     0                0               2                     0
## 6                     4               79               6                     2
##   fatalities_00_14 total_85_99 total_00_14
## 1                0           2           0
## 2               88         218          95
## 3                0           6           1
## 4                0          68           5
## 5                0           2           2
## 6              337          97         345

I wanted to create a new dataset that focused only on the airline and their totals from 85-99 and 00-14

air_total<-select(air_safe, airline, total_85_99, total_00_14)
head(air_total)
##                 airline total_85_99 total_00_14
## 1            Aer Lingus           2           0
## 2             Aeroflot*         218          95
## 3 Aerolineas Argentinas           6           1
## 4           Aeromexico*          68           5
## 5            Air Canada           2           2
## 6            Air France          97         345
summary(air_total)
##    airline           total_85_99      total_00_14   
##  Length:56          Min.   :  0.00   Min.   :  0.0  
##  Class :character   1st Qu.:  4.75   1st Qu.:  1.0  
##  Mode  :character   Median : 55.00   Median :  5.0  
##                     Mean   :121.77   Mean   : 60.3  
##                     3rd Qu.:221.50   3rd Qu.: 88.0  
##                     Max.   :553.00   Max.   :542.0

I used the spread function that I created for air_total to display the Airline numbers in a wide data set.

# Wide from 85 to 99
air_wide85 <-spread(air_total, airline, total_85_99)
head(air_wide85)
##   total_00_14 Aer Lingus Aeroflot* Aerolineas Argentinas Aeromexico* Air Canada
## 1           0          2        NA                    NA          NA         NA
## 2           1         NA        NA                     6          NA         NA
## 3           2         NA        NA                    NA          NA          2
## 4           3         NA        NA                    NA          NA         NA
## 5           4         NA        NA                    NA          NA         NA
## 6           5         NA        NA                    NA          68         NA
##   Air France Air India* Air New Zealand* Alaska Airlines* Alitalia
## 1         NA         NA               NA               NA       NA
## 2         NA         NA               NA               NA       NA
## 3         NA         NA               NA               NA       NA
## 4         NA         NA               NA               NA       NA
## 5         NA         NA               NA               NA       59
## 6         NA         NA               NA               NA       NA
##   All Nippon Airways American* Austrian Airlines Avianca British Airways*
## 1                 NA        NA                NA     331               NA
## 2                 NA        NA                 1      NA               NA
## 3                 NA        NA                NA      NA               NA
## 4                 NA        NA                NA      NA               NA
## 5                 NA        NA                NA      NA               NA
## 6                 NA        NA                NA      NA               NA
##   Cathay Pacific* China Airlines Condor COPA Delta / Northwest* Egyptair El Al
## 1              NA             NA     19   51                 NA       NA    NA
## 2              NA             NA     NA   NA                 NA       NA     6
## 3               0             NA     NA   NA                 NA       NA    NA
## 4              NA             NA     NA   NA                 NA       NA    NA
## 5              NA             NA     NA   NA                 NA       NA    NA
## 6              NA             NA     NA   NA                 NA       NA    NA
##   Ethiopian Airlines Finnair Garuda Indonesia Gulf Air Hawaiian Airlines Iberia
## 1                 NA       1               NA       NA                NA     NA
## 2                 NA      NA               NA       NA                 0     NA
## 3                 NA      NA               NA       NA                NA     NA
## 4                 NA      NA               NA       NA                NA     NA
## 5                 NA      NA               NA       NA                NA     NA
## 6                 NA      NA               NA       NA                NA    153
##   Japan Airlines Kenya Airways KLM* Korean Air LAN Airlines Lufthansa*
## 1            524            NA   NA         NA           26         NA
## 2             NA            NA   11        442           NA         NA
## 3             NA            NA   NA         NA           NA         NA
## 4             NA            NA   NA         NA           NA          9
## 5             NA            NA   NA         NA           NA         NA
## 6             NA            NA   NA         NA           NA         NA
##   Malaysia Airlines Pakistan International Philippine Airlines Qantas*
## 1                NA                     NA                  NA      NA
## 2                NA                     NA                  NA      NA
## 3                NA                     NA                  NA      NA
## 4                NA                     NA                  NA      NA
## 5                NA                     NA                  85      NA
## 6                NA                     NA                  NA       1
##   Royal Air Maroc SAS* Saudi Arabian Singapore Airlines South African
## 1              NA   NA            NA                 NA            NA
## 2              NA   NA            NA                 NA           162
## 3              NA   NA            NA                 NA            NA
## 4              59   NA            NA                 NA            NA
## 5              NA   NA            NA                 NA            NA
## 6              NA   NA            NA                 NA            NA
##   Southwest Airlines Sri Lankan / AirLanka SWISS* TACA TAM TAP - Air Portugal
## 1                 NA                    NA     NA   NA  NA                  0
## 2                 NA                    NA     NA   NA  NA                 NA
## 3                 NA                    NA     NA   NA  NA                 NA
## 4                 NA                    NA    232   NA  NA                 NA
## 5                 NA                    17     NA   NA  NA                 NA
## 6                 NA                    NA     NA    7  NA                 NA
##   Thai Airways Turkish Airlines United / Continental*
## 1           NA               NA                    NA
## 2           NA               NA                    NA
## 3           NA               NA                    NA
## 4           NA               NA                    NA
## 5          320               NA                    NA
## 6           NA               NA                    NA
##   US Airways / America West* Vietnam Airlines Virgin Atlantic Xiamen Airlines
## 1                         NA               NA               1              NA
## 2                         NA              181              NA              NA
## 3                         NA               NA              NA              92
## 4                         NA               NA              NA              NA
## 5                         NA               NA              NA              NA
## 6                         NA               NA              NA              NA
summary(air_wide85)
##   total_00_14       Aer Lingus   Aeroflot*   Aerolineas Argentinas  Aeromexico*
##  Min.   :  0.00   Min.   :2    Min.   :218   Min.   :6             Min.   :68  
##  1st Qu.:  7.25   1st Qu.:2    1st Qu.:218   1st Qu.:6             1st Qu.:68  
##  Median : 67.50   Median :2    Median :218   Median :6             Median :68  
##  Mean   :107.97   Mean   :2    Mean   :218   Mean   :6             Mean   :68  
##  3rd Qu.:141.50   3rd Qu.:2    3rd Qu.:218   3rd Qu.:6             3rd Qu.:68  
##  Max.   :542.00   Max.   :2    Max.   :218   Max.   :6             Max.   :68  
##                   NA's   :29   NA's   :29    NA's   :29            NA's   :29  
##    Air Canada   Air France   Air India*  Air New Zealand* Alaska Airlines*
##  Min.   :2    Min.   :97   Min.   :332   Min.   :3        Min.   :5       
##  1st Qu.:2    1st Qu.:97   1st Qu.:332   1st Qu.:3        1st Qu.:5       
##  Median :2    Median :97   Median :332   Median :3        Median :5       
##  Mean   :2    Mean   :97   Mean   :332   Mean   :3        Mean   :5       
##  3rd Qu.:2    3rd Qu.:97   3rd Qu.:332   3rd Qu.:3        3rd Qu.:5       
##  Max.   :2    Max.   :97   Max.   :332   Max.   :3        Max.   :5       
##  NA's   :29   NA's   :29   NA's   :29    NA's   :29       NA's   :29      
##     Alitalia  All Nippon Airways   American*   Austrian Airlines    Avianca   
##  Min.   :59   Min.   :5          Min.   :127   Min.   :1         Min.   :331  
##  1st Qu.:59   1st Qu.:5          1st Qu.:127   1st Qu.:1         1st Qu.:331  
##  Median :59   Median :5          Median :127   Median :1         Median :331  
##  Mean   :59   Mean   :5          Mean   :127   Mean   :1         Mean   :331  
##  3rd Qu.:59   3rd Qu.:5          3rd Qu.:127   3rd Qu.:1         3rd Qu.:331  
##  Max.   :59   Max.   :5          Max.   :127   Max.   :1         Max.   :331  
##  NA's   :29   NA's   :29         NA's   :29    NA's   :29        NA's   :29   
##  British Airways* Cathay Pacific* China Airlines     Condor        COPA   
##  Min.   :4        Min.   :0       Min.   :553    Min.   :19   Min.   :51  
##  1st Qu.:4        1st Qu.:0       1st Qu.:553    1st Qu.:19   1st Qu.:51  
##  Median :4        Median :0       Median :553    Median :19   Median :51  
##  Mean   :4        Mean   :0       Mean   :553    Mean   :19   Mean   :51  
##  3rd Qu.:4        3rd Qu.:0       3rd Qu.:553    3rd Qu.:19   3rd Qu.:51  
##  Max.   :4        Max.   :0       Max.   :553    Max.   :19   Max.   :51  
##  NA's   :29       NA's   :29      NA's   :29     NA's   :29   NA's   :29  
##  Delta / Northwest*    Egyptair       El Al    Ethiopian Airlines    Finnair  
##  Min.   :443        Min.   :293   Min.   :6    Min.   :197        Min.   :1   
##  1st Qu.:443        1st Qu.:293   1st Qu.:6    1st Qu.:197        1st Qu.:1   
##  Median :443        Median :293   Median :6    Median :197        Median :1   
##  Mean   :443        Mean   :293   Mean   :6    Mean   :197        Mean   :1   
##  3rd Qu.:443        3rd Qu.:293   3rd Qu.:6    3rd Qu.:197        3rd Qu.:1   
##  Max.   :443        Max.   :293   Max.   :6    Max.   :197        Max.   :1   
##  NA's   :29         NA's   :29    NA's   :29   NA's   :29         NA's   :29  
##  Garuda Indonesia    Gulf Air  Hawaiian Airlines     Iberia    Japan Airlines
##  Min.   :273      Min.   :1    Min.   :0         Min.   :153   Min.   :524   
##  1st Qu.:273      1st Qu.:1    1st Qu.:0         1st Qu.:153   1st Qu.:524   
##  Median :273      Median :1    Median :0         Median :153   Median :524   
##  Mean   :273      Mean   :1    Mean   :0         Mean   :153   Mean   :524   
##  3rd Qu.:273      3rd Qu.:1    3rd Qu.:0         3rd Qu.:153   3rd Qu.:524   
##  Max.   :273      Max.   :1    Max.   :0         Max.   :153   Max.   :524   
##  NA's   :29       NA's   :29   NA's   :29        NA's   :29    NA's   :29    
##  Kenya Airways      KLM*      Korean Air   LAN Airlines   Lufthansa*
##  Min.   :2     Min.   :11   Min.   :442   Min.   :26    Min.   :9   
##  1st Qu.:2     1st Qu.:11   1st Qu.:442   1st Qu.:26    1st Qu.:9   
##  Median :2     Median :11   Median :442   Median :26    Median :9   
##  Mean   :2     Mean   :11   Mean   :442   Mean   :26    Mean   :9   
##  3rd Qu.:2     3rd Qu.:11   3rd Qu.:442   3rd Qu.:26    3rd Qu.:9   
##  Max.   :2     Max.   :11   Max.   :442   Max.   :26    Max.   :9   
##  NA's   :29    NA's   :29   NA's   :29    NA's   :29    NA's   :29  
##  Malaysia Airlines Pakistan International Philippine Airlines    Qantas*  
##  Min.   :38        Min.   :245            Min.   :85          Min.   :1   
##  1st Qu.:38        1st Qu.:245            1st Qu.:85          1st Qu.:1   
##  Median :38        Median :245            Median :85          Median :1   
##  Mean   :38        Mean   :245            Mean   :85          Mean   :1   
##  3rd Qu.:38        3rd Qu.:245            3rd Qu.:85          3rd Qu.:1   
##  Max.   :38        Max.   :245            Max.   :85          Max.   :1   
##  NA's   :29        NA's   :29             NA's   :29          NA's   :29  
##  Royal Air Maroc      SAS*    Saudi Arabian Singapore Airlines South African
##  Min.   :59      Min.   :5    Min.   :322   Min.   :10         Min.   :162  
##  1st Qu.:59      1st Qu.:5    1st Qu.:322   1st Qu.:10         1st Qu.:162  
##  Median :59      Median :5    Median :322   Median :10         Median :162  
##  Mean   :59      Mean   :5    Mean   :322   Mean   :10         Mean   :162  
##  3rd Qu.:59      3rd Qu.:5    3rd Qu.:322   3rd Qu.:10         3rd Qu.:162  
##  Max.   :59      Max.   :5    Max.   :322   Max.   :10         Max.   :162  
##  NA's   :29      NA's   :29   NA's   :29    NA's   :29         NA's   :29   
##  Southwest Airlines Sri Lankan / AirLanka     SWISS*         TACA   
##  Min.   :1          Min.   :17            Min.   :232   Min.   :7   
##  1st Qu.:1          1st Qu.:17            1st Qu.:232   1st Qu.:7   
##  Median :1          Median :17            Median :232   Median :7   
##  Mean   :1          Mean   :17            Mean   :232   Mean   :7   
##  3rd Qu.:1          3rd Qu.:17            3rd Qu.:232   3rd Qu.:7   
##  Max.   :1          Max.   :17            Max.   :232   Max.   :7   
##  NA's   :29         NA's   :29            NA's   :29    NA's   :29  
##       TAM      TAP - Air Portugal  Thai Airways Turkish Airlines
##  Min.   :109   Min.   :0          Min.   :320   Min.   :75      
##  1st Qu.:109   1st Qu.:0          1st Qu.:320   1st Qu.:75      
##  Median :109   Median :0          Median :320   Median :75      
##  Mean   :109   Mean   :0          Mean   :320   Mean   :75      
##  3rd Qu.:109   3rd Qu.:0          3rd Qu.:320   3rd Qu.:75      
##  Max.   :109   Max.   :0          Max.   :320   Max.   :75      
##  NA's   :29    NA's   :29         NA's   :29    NA's   :29      
##  United / Continental* US Airways / America West* Vietnam Airlines
##  Min.   :346           Min.   :247                Min.   :181     
##  1st Qu.:346           1st Qu.:247                1st Qu.:181     
##  Median :346           Median :247                Median :181     
##  Mean   :346           Mean   :247                Mean   :181     
##  3rd Qu.:346           3rd Qu.:247                3rd Qu.:181     
##  Max.   :346           Max.   :247                Max.   :181     
##  NA's   :29            NA's   :29                 NA's   :29      
##  Virgin Atlantic Xiamen Airlines
##  Min.   :1       Min.   :92     
##  1st Qu.:1       1st Qu.:92     
##  Median :1       Median :92     
##  Mean   :1       Mean   :92     
##  3rd Qu.:1       3rd Qu.:92     
##  Max.   :1       Max.   :92     
##  NA's   :29      NA's   :29
# Wide from 85 to 99
air_wide00 <-spread(air_total, airline, total_00_14)
head(air_wide00)
##   total_85_99 Aer Lingus Aeroflot* Aerolineas Argentinas Aeromexico* Air Canada
## 1           0         NA        NA                    NA          NA         NA
## 2           1         NA        NA                    NA          NA         NA
## 3           2          0        NA                    NA          NA          2
## 4           3         NA        NA                    NA          NA         NA
## 5           4         NA        NA                    NA          NA         NA
## 6           5         NA        NA                    NA          NA         NA
##   Air France Air India* Air New Zealand* Alaska Airlines* Alitalia
## 1         NA         NA               NA               NA       NA
## 2         NA         NA               NA               NA       NA
## 3         NA         NA               NA               NA       NA
## 4         NA         NA               13               NA       NA
## 5         NA         NA               NA               NA       NA
## 6         NA         NA               NA               94       NA
##   All Nippon Airways American* Austrian Airlines Avianca British Airways*
## 1                 NA        NA                NA      NA               NA
## 2                 NA        NA                 1      NA               NA
## 3                 NA        NA                NA      NA               NA
## 4                 NA        NA                NA      NA               NA
## 5                 NA        NA                NA      NA                6
## 6                  7        NA                NA      NA               NA
##   Cathay Pacific* China Airlines Condor COPA Delta / Northwest* Egyptair El Al
## 1               2             NA     NA   NA                 NA       NA    NA
## 2              NA             NA     NA   NA                 NA       NA    NA
## 3              NA             NA     NA   NA                 NA       NA    NA
## 4              NA             NA     NA   NA                 NA       NA    NA
## 5              NA             NA     NA   NA                 NA       NA    NA
## 6              NA             NA     NA   NA                 NA       NA    NA
##   Ethiopian Airlines Finnair Garuda Indonesia Gulf Air Hawaiian Airlines Iberia
## 1                 NA      NA               NA       NA                 1     NA
## 2                 NA       0               NA      147                NA     NA
## 3                 NA      NA               NA       NA                NA     NA
## 4                 NA      NA               NA       NA                NA     NA
## 5                 NA      NA               NA       NA                NA     NA
## 6                 NA      NA               NA       NA                NA     NA
##   Japan Airlines Kenya Airways KLM* Korean Air LAN Airlines Lufthansa*
## 1             NA            NA   NA         NA           NA         NA
## 2             NA            NA   NA         NA           NA         NA
## 3             NA           287   NA         NA           NA         NA
## 4             NA            NA   NA         NA           NA         NA
## 5             NA            NA   NA         NA           NA         NA
## 6             NA            NA   NA         NA           NA         NA
##   Malaysia Airlines Pakistan International Philippine Airlines Qantas*
## 1                NA                     NA                  NA      NA
## 2                NA                     NA                  NA       5
## 3                NA                     NA                  NA      NA
## 4                NA                     NA                  NA      NA
## 5                NA                     NA                  NA      NA
## 6                NA                     NA                  NA      NA
##   Royal Air Maroc SAS* Saudi Arabian Singapore Airlines South African
## 1              NA   NA            NA                 NA            NA
## 2              NA   NA            NA                 NA            NA
## 3              NA   NA            NA                 NA            NA
## 4              NA   NA            NA                 NA            NA
## 5              NA   NA            NA                 NA            NA
## 6              NA  117            NA                 NA            NA
##   Southwest Airlines Sri Lankan / AirLanka SWISS* TACA TAM TAP - Air Portugal
## 1                 NA                    NA     NA   NA  NA                  0
## 2                  8                    NA     NA   NA  NA                 NA
## 3                 NA                    NA     NA   NA  NA                 NA
## 4                 NA                    NA     NA   NA  NA                 NA
## 5                 NA                    NA     NA   NA  NA                 NA
## 6                 NA                    NA     NA   NA  NA                 NA
##   Thai Airways Turkish Airlines United / Continental*
## 1           NA               NA                    NA
## 2           NA               NA                    NA
## 3           NA               NA                    NA
## 4           NA               NA                    NA
## 5           NA               NA                    NA
## 6           NA               NA                    NA
##   US Airways / America West* Vietnam Airlines Virgin Atlantic Xiamen Airlines
## 1                         NA               NA              NA              NA
## 2                         NA               NA               0              NA
## 3                         NA               NA              NA              NA
## 4                         NA               NA              NA              NA
## 5                         NA               NA              NA              NA
## 6                         NA               NA              NA              NA
summary(air_wide00)
##   total_85_99      Aer Lingus   Aeroflot*  Aerolineas Argentinas  Aeromexico*
##  Min.   :  0.0   Min.   :0    Min.   :95   Min.   :1             Min.   :5   
##  1st Qu.: 14.0   1st Qu.:0    1st Qu.:95   1st Qu.:1             1st Qu.:5   
##  Median : 97.0   Median :0    Median :95   Median :1             Median :5   
##  Mean   :156.6   Mean   :0    Mean   :95   Mean   :1             Mean   :5   
##  3rd Qu.:260.0   3rd Qu.:0    3rd Qu.:95   3rd Qu.:1             3rd Qu.:5   
##  Max.   :553.0   Max.   :0    Max.   :95   Max.   :1             Max.   :5   
##                  NA's   :42   NA's   :42   NA's   :42            NA's   :42  
##    Air Canada   Air France    Air India*  Air New Zealand* Alaska Airlines*
##  Min.   :2    Min.   :345   Min.   :163   Min.   :13       Min.   :94      
##  1st Qu.:2    1st Qu.:345   1st Qu.:163   1st Qu.:13       1st Qu.:94      
##  Median :2    Median :345   Median :163   Median :13       Median :94      
##  Mean   :2    Mean   :345   Mean   :163   Mean   :13       Mean   :94      
##  3rd Qu.:2    3rd Qu.:345   3rd Qu.:163   3rd Qu.:13       3rd Qu.:94      
##  Max.   :2    Max.   :345   Max.   :163   Max.   :13       Max.   :94      
##  NA's   :42   NA's   :42    NA's   :42    NA's   :42       NA's   :42      
##     Alitalia  All Nippon Airways   American*   Austrian Airlines    Avianca  
##  Min.   :4    Min.   :7          Min.   :436   Min.   :1         Min.   :0   
##  1st Qu.:4    1st Qu.:7          1st Qu.:436   1st Qu.:1         1st Qu.:0   
##  Median :4    Median :7          Median :436   Median :1         Median :0   
##  Mean   :4    Mean   :7          Mean   :436   Mean   :1         Mean   :0   
##  3rd Qu.:4    3rd Qu.:7          3rd Qu.:436   3rd Qu.:1         3rd Qu.:0   
##  Max.   :4    Max.   :7          Max.   :436   Max.   :1         Max.   :0   
##  NA's   :42   NA's   :42         NA's   :42    NA's   :42        NA's   :42  
##  British Airways* Cathay Pacific* China Airlines     Condor        COPA   
##  Min.   :6        Min.   :2       Min.   :228    Min.   :0    Min.   :0   
##  1st Qu.:6        1st Qu.:2       1st Qu.:228    1st Qu.:0    1st Qu.:0   
##  Median :6        Median :2       Median :228    Median :0    Median :0   
##  Mean   :6        Mean   :2       Mean   :228    Mean   :0    Mean   :0   
##  3rd Qu.:6        3rd Qu.:2       3rd Qu.:228    3rd Qu.:0    3rd Qu.:0   
##  Max.   :6        Max.   :2       Max.   :228    Max.   :0    Max.   :0   
##  NA's   :42       NA's   :42      NA's   :42     NA's   :42   NA's   :42  
##  Delta / Northwest*    Egyptair      El Al    Ethiopian Airlines    Finnair  
##  Min.   :77         Min.   :19   Min.   :1    Min.   :99         Min.   :0   
##  1st Qu.:77         1st Qu.:19   1st Qu.:1    1st Qu.:99         1st Qu.:0   
##  Median :77         Median :19   Median :1    Median :99         Median :0   
##  Mean   :77         Mean   :19   Mean   :1    Mean   :99         Mean   :0   
##  3rd Qu.:77         3rd Qu.:19   3rd Qu.:1    3rd Qu.:99         3rd Qu.:0   
##  Max.   :77         Max.   :19   Max.   :1    Max.   :99         Max.   :0   
##  NA's   :42         NA's   :42   NA's   :42   NA's   :42         NA's   :42  
##  Garuda Indonesia    Gulf Air   Hawaiian Airlines     Iberia   Japan Airlines
##  Min.   :28       Min.   :147   Min.   :1         Min.   :5    Min.   :0     
##  1st Qu.:28       1st Qu.:147   1st Qu.:1         1st Qu.:5    1st Qu.:0     
##  Median :28       Median :147   Median :1         Median :5    Median :0     
##  Mean   :28       Mean   :147   Mean   :1         Mean   :5    Mean   :0     
##  3rd Qu.:28       3rd Qu.:147   3rd Qu.:1         3rd Qu.:5    3rd Qu.:0     
##  Max.   :28       Max.   :147   Max.   :1         Max.   :5    Max.   :0     
##  NA's   :42       NA's   :42    NA's   :42        NA's   :42   NA's   :42    
##  Kenya Airways      KLM*      Korean Air  LAN Airlines   Lufthansa*
##  Min.   :287   Min.   :1    Min.   :1    Min.   :0     Min.   :3   
##  1st Qu.:287   1st Qu.:1    1st Qu.:1    1st Qu.:0     1st Qu.:3   
##  Median :287   Median :1    Median :1    Median :0     Median :3   
##  Mean   :287   Mean   :1    Mean   :1    Mean   :0     Mean   :3   
##  3rd Qu.:287   3rd Qu.:1    3rd Qu.:1    3rd Qu.:0     3rd Qu.:3   
##  Max.   :287   Max.   :1    Max.   :1    Max.   :0     Max.   :3   
##  NA's   :42    NA's   :42   NA's   :42   NA's   :42    NA's   :42  
##  Malaysia Airlines Pakistan International Philippine Airlines    Qantas*  
##  Min.   :542       Min.   :58             Min.   :4           Min.   :5   
##  1st Qu.:542       1st Qu.:58             1st Qu.:4           1st Qu.:5   
##  Median :542       Median :58             Median :4           Median :5   
##  Mean   :542       Mean   :58             Mean   :4           Mean   :5   
##  3rd Qu.:542       3rd Qu.:58             3rd Qu.:4           3rd Qu.:5   
##  Max.   :542       Max.   :58             Max.   :4           Max.   :5   
##  NA's   :42        NA's   :42             NA's   :42          NA's   :42  
##  Royal Air Maroc      SAS*     Saudi Arabian Singapore Airlines South African
##  Min.   :3       Min.   :117   Min.   :11    Min.   :86         Min.   :1    
##  1st Qu.:3       1st Qu.:117   1st Qu.:11    1st Qu.:86         1st Qu.:1    
##  Median :3       Median :117   Median :11    Median :86         Median :1    
##  Mean   :3       Mean   :117   Mean   :11    Mean   :86         Mean   :1    
##  3rd Qu.:3       3rd Qu.:117   3rd Qu.:11    3rd Qu.:86         3rd Qu.:1    
##  Max.   :3       Max.   :117   Max.   :11    Max.   :86         Max.   :1    
##  NA's   :42      NA's   :42    NA's   :42    NA's   :42         NA's   :42   
##  Southwest Airlines Sri Lankan / AirLanka     SWISS*        TACA   
##  Min.   :8          Min.   :4             Min.   :3    Min.   :5   
##  1st Qu.:8          1st Qu.:4             1st Qu.:3    1st Qu.:5   
##  Median :8          Median :4             Median :3    Median :5   
##  Mean   :8          Mean   :4             Mean   :3    Mean   :5   
##  3rd Qu.:8          3rd Qu.:4             3rd Qu.:3    3rd Qu.:5   
##  Max.   :8          Max.   :4             Max.   :3    Max.   :5   
##  NA's   :42         NA's   :42            NA's   :42   NA's   :42  
##       TAM      TAP - Air Portugal  Thai Airways Turkish Airlines
##  Min.   :197   Min.   :0          Min.   :4     Min.   :94      
##  1st Qu.:197   1st Qu.:0          1st Qu.:4     1st Qu.:94      
##  Median :197   Median :0          Median :4     Median :94      
##  Mean   :197   Mean   :0          Mean   :4     Mean   :94      
##  3rd Qu.:197   3rd Qu.:0          3rd Qu.:4     3rd Qu.:94      
##  Max.   :197   Max.   :0          Max.   :4     Max.   :94      
##  NA's   :42    NA's   :42         NA's   :42    NA's   :42      
##  United / Continental* US Airways / America West* Vietnam Airlines
##  Min.   :125           Min.   :36                 Min.   :1       
##  1st Qu.:125           1st Qu.:36                 1st Qu.:1       
##  Median :125           Median :36                 Median :1       
##  Mean   :125           Mean   :36                 Mean   :1       
##  3rd Qu.:125           3rd Qu.:36                 3rd Qu.:1       
##  Max.   :125           Max.   :36                 Max.   :1       
##  NA's   :42            NA's   :42                 NA's   :42      
##  Virgin Atlantic Xiamen Airlines
##  Min.   :0       Min.   :2      
##  1st Qu.:0       1st Qu.:2      
##  Median :0       Median :2      
##  Mean   :0       Mean   :2      
##  3rd Qu.:0       3rd Qu.:2      
##  Max.   :0       Max.   :2      
##  NA's   :42      NA's   :42

I wanted to sort who had the most incidents, fatal accidents, and fatalities between 1985 and 1999.

air_max85<-air_total[order(air_total$total_85_99, decreasing = TRUE),]
head(air_max85)
##                  airline total_85_99 total_00_14
## 17        China Airlines         553         228
## 29        Japan Airlines         524           0
## 20    Delta / Northwest*         443          77
## 32            Korean Air         442           1
## 52 United / Continental*         346         125
## 7             Air India*         332         163
summary(air_max85)
##    airline           total_85_99      total_00_14   
##  Length:56          Min.   :  0.00   Min.   :  0.0  
##  Class :character   1st Qu.:  4.75   1st Qu.:  1.0  
##  Mode  :character   Median : 55.00   Median :  5.0  
##                     Mean   :121.77   Mean   : 60.3  
##                     3rd Qu.:221.50   3rd Qu.: 88.0  
##                     Max.   :553.00   Max.   :542.0

According to the data, China Airlines had the most combined incidents, fatal incidents, and fatalities from 1985 to 1999.

air_max00<-air_total[order(air_total$total_00_14, decreasing = TRUE),]
head(air_max00)
##              airline total_85_99 total_00_14
## 35 Malaysia Airlines          38         542
## 12         American*         127         436
## 6         Air France          97         345
## 30     Kenya Airways           2         287
## 17    China Airlines         553         228
## 48               TAM         109         197

On the other hand, Malaysia Airlines had the most combined incidents, fatal incidents, and fatalities from 2000 to 2014. China Airlines, however is in the top 5. This shows very little improvement over the span of 30 years.

ggplot(air_total, aes(x = "", y = total_85_99, fill = airline)) +  
          geom_bar(stat="identity",position="dodge") + 
          xlab("Airlines") + ylab("Incidents, Fatal Incidents, & Fatalities 1985 to 1999") 

ggplot(air_total, aes(x = "", y = total_00_14, fill = airline)) +  
          geom_bar(stat="identity",position="dodge") + 
          xlab("Airlines") + ylab("Incidents, Fatal Incidents, & Fatalities 2000 to 2014") 

I highly recommend looking and introducing flight times because the time of the day plays a major role in incidents, fatal incidents, and fatalities. In addition to this, I also recommend including the types of disasters that have occurred.

Dataset 2

I chose Magnus Skonberg’s dataset that he retrieved from the World Happiness Report on kaggle.com. I saved the csv file on my github page. We will compare the Happiness Score and GDP per capita for the Top 20 countries and see if there is a correlation between the two.

#Upload data set

happy<- read.csv("https://raw.githubusercontent.com/sagreen131/DATA-607-Project-2/main/2019.csv", header = TRUE, stringsAsFactors = FALSE)

head(happy)
##   Overall.rank Country.or.region Score GDP.per.capita Social.support
## 1            1           Finland 7.769          1.340          1.587
## 2            2           Denmark 7.600          1.383          1.573
## 3            3            Norway 7.554          1.488          1.582
## 4            4           Iceland 7.494          1.380          1.624
## 5            5       Netherlands 7.488          1.396          1.522
## 6            6       Switzerland 7.480          1.452          1.526
##   Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## 1                   0.986                        0.596      0.153
## 2                   0.996                        0.592      0.252
## 3                   1.028                        0.603      0.271
## 4                   1.026                        0.591      0.354
## 5                   0.999                        0.557      0.322
## 6                   1.052                        0.572      0.263
##   Perceptions.of.corruption
## 1                     0.393
## 2                     0.410
## 3                     0.341
## 4                     0.118
## 5                     0.298
## 6                     0.343
summary(happy)
##   Overall.rank    Country.or.region      Score       GDP.per.capita  
##  Min.   :  1.00   Length:156         Min.   :2.853   Min.   :0.0000  
##  1st Qu.: 39.75   Class :character   1st Qu.:4.545   1st Qu.:0.6028  
##  Median : 78.50   Mode  :character   Median :5.380   Median :0.9600  
##  Mean   : 78.50                      Mean   :5.407   Mean   :0.9051  
##  3rd Qu.:117.25                      3rd Qu.:6.184   3rd Qu.:1.2325  
##  Max.   :156.00                      Max.   :7.769   Max.   :1.6840  
##  Social.support  Healthy.life.expectancy Freedom.to.make.life.choices
##  Min.   :0.000   Min.   :0.0000          Min.   :0.0000              
##  1st Qu.:1.056   1st Qu.:0.5477          1st Qu.:0.3080              
##  Median :1.272   Median :0.7890          Median :0.4170              
##  Mean   :1.209   Mean   :0.7252          Mean   :0.3926              
##  3rd Qu.:1.452   3rd Qu.:0.8818          3rd Qu.:0.5072              
##  Max.   :1.624   Max.   :1.1410          Max.   :0.6310              
##    Generosity     Perceptions.of.corruption
##  Min.   :0.0000   Min.   :0.0000           
##  1st Qu.:0.1087   1st Qu.:0.0470           
##  Median :0.1775   Median :0.0855           
##  Mean   :0.1848   Mean   :0.1106           
##  3rd Qu.:0.2482   3rd Qu.:0.1412           
##  Max.   :0.5660   Max.   :0.4530

I will create a new data set and focus on the top 20 countries by overall rank.

happy20 <-happy[1:20,]

happy20
##    Overall.rank Country.or.region Score GDP.per.capita Social.support
## 1             1           Finland 7.769          1.340          1.587
## 2             2           Denmark 7.600          1.383          1.573
## 3             3            Norway 7.554          1.488          1.582
## 4             4           Iceland 7.494          1.380          1.624
## 5             5       Netherlands 7.488          1.396          1.522
## 6             6       Switzerland 7.480          1.452          1.526
## 7             7            Sweden 7.343          1.387          1.487
## 8             8       New Zealand 7.307          1.303          1.557
## 9             9            Canada 7.278          1.365          1.505
## 10           10           Austria 7.246          1.376          1.475
## 11           11         Australia 7.228          1.372          1.548
## 12           12        Costa Rica 7.167          1.034          1.441
## 13           13            Israel 7.139          1.276          1.455
## 14           14        Luxembourg 7.090          1.609          1.479
## 15           15    United Kingdom 7.054          1.333          1.538
## 16           16           Ireland 7.021          1.499          1.553
## 17           17           Germany 6.985          1.373          1.454
## 18           18           Belgium 6.923          1.356          1.504
## 19           19     United States 6.892          1.433          1.457
## 20           20    Czech Republic 6.852          1.269          1.487
##    Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## 1                    0.986                        0.596      0.153
## 2                    0.996                        0.592      0.252
## 3                    1.028                        0.603      0.271
## 4                    1.026                        0.591      0.354
## 5                    0.999                        0.557      0.322
## 6                    1.052                        0.572      0.263
## 7                    1.009                        0.574      0.267
## 8                    1.026                        0.585      0.330
## 9                    1.039                        0.584      0.285
## 10                   1.016                        0.532      0.244
## 11                   1.036                        0.557      0.332
## 12                   0.963                        0.558      0.144
## 13                   1.029                        0.371      0.261
## 14                   1.012                        0.526      0.194
## 15                   0.996                        0.450      0.348
## 16                   0.999                        0.516      0.298
## 17                   0.987                        0.495      0.261
## 18                   0.986                        0.473      0.160
## 19                   0.874                        0.454      0.280
## 20                   0.920                        0.457      0.046
##    Perceptions.of.corruption
## 1                      0.393
## 2                      0.410
## 3                      0.341
## 4                      0.118
## 5                      0.298
## 6                      0.343
## 7                      0.373
## 8                      0.380
## 9                      0.308
## 10                     0.226
## 11                     0.290
## 12                     0.093
## 13                     0.082
## 14                     0.316
## 15                     0.278
## 16                     0.310
## 17                     0.265
## 18                     0.210
## 19                     0.128
## 20                     0.036
summary(happy20)
##   Overall.rank   Country.or.region      Score       GDP.per.capita 
##  Min.   : 1.00   Length:20          Min.   :6.852   Min.   :1.034  
##  1st Qu.: 5.75   Class :character   1st Qu.:7.046   1st Qu.:1.338  
##  Median :10.50   Mode  :character   Median :7.237   Median :1.375  
##  Mean   :10.50                      Mean   :7.245   Mean   :1.371  
##  3rd Qu.:15.25                      3rd Qu.:7.482   3rd Qu.:1.405  
##  Max.   :20.00                      Max.   :7.769   Max.   :1.609  
##  Social.support  Healthy.life.expectancy Freedom.to.make.life.choices
##  Min.   :1.441   Min.   :0.8740          Min.   :0.3710              
##  1st Qu.:1.478   1st Qu.:0.9868          1st Qu.:0.4895              
##  Median :1.514   Median :1.0040          Median :0.5570              
##  Mean   :1.518   Mean   :0.9990          Mean   :0.5322              
##  3rd Qu.:1.554   3rd Qu.:1.0265          3rd Qu.:0.5843              
##  Max.   :1.624   Max.   :1.0520          Max.   :0.6030              
##    Generosity     Perceptions.of.corruption
##  Min.   :0.0460   Min.   :0.0360           
##  1st Qu.:0.2315   1st Qu.:0.1895           
##  Median :0.2650   Median :0.2940           
##  Mean   :0.2532   Mean   :0.2599           
##  3rd Qu.:0.3040   3rd Qu.:0.3415           
##  Max.   :0.3540   Max.   :0.4100

Upon reviewing the new data set, I wanted to first compare some of the variables in the data set.

freedom<-happy20[order(happy20$Freedom.to.make.life.choices, decreasing = TRUE),]
head(freedom)
##   Overall.rank Country.or.region Score GDP.per.capita Social.support
## 3            3            Norway 7.554          1.488          1.582
## 1            1           Finland 7.769          1.340          1.587
## 2            2           Denmark 7.600          1.383          1.573
## 4            4           Iceland 7.494          1.380          1.624
## 8            8       New Zealand 7.307          1.303          1.557
## 9            9            Canada 7.278          1.365          1.505
##   Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## 3                   1.028                        0.603      0.271
## 1                   0.986                        0.596      0.153
## 2                   0.996                        0.592      0.252
## 4                   1.026                        0.591      0.354
## 8                   1.026                        0.585      0.330
## 9                   1.039                        0.584      0.285
##   Perceptions.of.corruption
## 3                     0.341
## 1                     0.393
## 2                     0.410
## 4                     0.118
## 8                     0.380
## 9                     0.308
corruption<-happy20[order(happy20$Perceptions.of.corruption, decreasing = TRUE),]
head(corruption)
##   Overall.rank Country.or.region Score GDP.per.capita Social.support
## 2            2           Denmark 7.600          1.383          1.573
## 1            1           Finland 7.769          1.340          1.587
## 8            8       New Zealand 7.307          1.303          1.557
## 7            7            Sweden 7.343          1.387          1.487
## 6            6       Switzerland 7.480          1.452          1.526
## 3            3            Norway 7.554          1.488          1.582
##   Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## 2                   0.996                        0.592      0.252
## 1                   0.986                        0.596      0.153
## 8                   1.026                        0.585      0.330
## 7                   1.009                        0.574      0.267
## 6                   1.052                        0.572      0.263
## 3                   1.028                        0.603      0.271
##   Perceptions.of.corruption
## 2                     0.410
## 1                     0.393
## 8                     0.380
## 7                     0.373
## 6                     0.343
## 3                     0.341
healthy<-happy20[order(happy20$Healthy.life.expectancy, decreasing = TRUE),]
head(healthy)
##    Overall.rank Country.or.region Score GDP.per.capita Social.support
## 6             6       Switzerland 7.480          1.452          1.526
## 9             9            Canada 7.278          1.365          1.505
## 11           11         Australia 7.228          1.372          1.548
## 13           13            Israel 7.139          1.276          1.455
## 3             3            Norway 7.554          1.488          1.582
## 4             4           Iceland 7.494          1.380          1.624
##    Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## 6                    1.052                        0.572      0.263
## 9                    1.039                        0.584      0.285
## 11                   1.036                        0.557      0.332
## 13                   1.029                        0.371      0.261
## 3                    1.028                        0.603      0.271
## 4                    1.026                        0.591      0.354
##    Perceptions.of.corruption
## 6                      0.343
## 9                      0.308
## 11                     0.290
## 13                     0.082
## 3                      0.341
## 4                      0.118
generous<-happy20[order(happy20$Generosity, decreasing = TRUE),]
head(generous)
##    Overall.rank Country.or.region Score GDP.per.capita Social.support
## 4             4           Iceland 7.494          1.380          1.624
## 15           15    United Kingdom 7.054          1.333          1.538
## 11           11         Australia 7.228          1.372          1.548
## 8             8       New Zealand 7.307          1.303          1.557
## 5             5       Netherlands 7.488          1.396          1.522
## 16           16           Ireland 7.021          1.499          1.553
##    Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## 4                    1.026                        0.591      0.354
## 15                   0.996                        0.450      0.348
## 11                   1.036                        0.557      0.332
## 8                    1.026                        0.585      0.330
## 5                    0.999                        0.557      0.322
## 16                   0.999                        0.516      0.298
##    Perceptions.of.corruption
## 4                      0.118
## 15                     0.278
## 11                     0.290
## 8                      0.380
## 5                      0.298
## 16                     0.310

Upon reviewing the 4 variables, I noticed that Norway is in the top 10 in each category. Finland is listed in the top 10 in 2 out of 4 variables. Sweden is in 3 out of the 4 variables. United States is listed in only 1 of the 4 variables.

Let’s add up the column variable by creating a new column variable and then compare them to the Score and GDP per capita.

happy20 <- happy20 %>%
    mutate(Total = Social.support + Healthy.life.expectancy + Freedom.to.make.life.choices + Generosity + Perceptions.of.corruption)
colnames(happy20)
##  [1] "Overall.rank"                 "Country.or.region"           
##  [3] "Score"                        "GDP.per.capita"              
##  [5] "Social.support"               "Healthy.life.expectancy"     
##  [7] "Freedom.to.make.life.choices" "Generosity"                  
##  [9] "Perceptions.of.corruption"    "Total"
happy20_total<-select(happy20, Country.or.region, Score, GDP.per.capita, Total)
head(happy20_total)
##   Country.or.region Score GDP.per.capita Total
## 1           Finland 7.769          1.340 3.715
## 2           Denmark 7.600          1.383 3.823
## 3            Norway 7.554          1.488 3.825
## 4           Iceland 7.494          1.380 3.713
## 5       Netherlands 7.488          1.396 3.698
## 6       Switzerland 7.480          1.452 3.756

I noticed that although Finland is at the top of the list overall, the total variables column I created does not dictate that Finland has the highest overall score.

# Sort by Total
happy20_total[order(happy20_total$Total, decreasing = TRUE),]
##    Country.or.region Score GDP.per.capita Total
## 8        New Zealand 7.307          1.303 3.878
## 3             Norway 7.554          1.488 3.825
## 2            Denmark 7.600          1.383 3.823
## 11         Australia 7.228          1.372 3.763
## 6        Switzerland 7.480          1.452 3.756
## 9             Canada 7.278          1.365 3.721
## 1            Finland 7.769          1.340 3.715
## 4            Iceland 7.494          1.380 3.713
## 7             Sweden 7.343          1.387 3.710
## 5        Netherlands 7.488          1.396 3.698
## 16           Ireland 7.021          1.499 3.676
## 15    United Kingdom 7.054          1.333 3.610
## 14        Luxembourg 7.090          1.609 3.527
## 10           Austria 7.246          1.376 3.493
## 17           Germany 6.985          1.373 3.462
## 18           Belgium 6.923          1.356 3.333
## 12        Costa Rica 7.167          1.034 3.199
## 13            Israel 7.139          1.276 3.198
## 19     United States 6.892          1.433 3.193
## 20    Czech Republic 6.852          1.269 2.946
# Sort by GDP
happy20_total[order(happy20_total$GDP.per.capita, decreasing = TRUE),]
##    Country.or.region Score GDP.per.capita Total
## 14        Luxembourg 7.090          1.609 3.527
## 16           Ireland 7.021          1.499 3.676
## 3             Norway 7.554          1.488 3.825
## 6        Switzerland 7.480          1.452 3.756
## 19     United States 6.892          1.433 3.193
## 5        Netherlands 7.488          1.396 3.698
## 7             Sweden 7.343          1.387 3.710
## 2            Denmark 7.600          1.383 3.823
## 4            Iceland 7.494          1.380 3.713
## 10           Austria 7.246          1.376 3.493
## 17           Germany 6.985          1.373 3.462
## 11         Australia 7.228          1.372 3.763
## 9             Canada 7.278          1.365 3.721
## 18           Belgium 6.923          1.356 3.333
## 1            Finland 7.769          1.340 3.715
## 15    United Kingdom 7.054          1.333 3.610
## 8        New Zealand 7.307          1.303 3.878
## 13            Israel 7.139          1.276 3.198
## 20    Czech Republic 6.852          1.269 2.946
## 12        Costa Rica 7.167          1.034 3.199

After sorting out the data sets, Norway is in the top 3 for each category. Luxembourg, however tops the list with highest number of GDP per capita.

# Scores
ggplot(happy20_total, aes(x = "", y = Score, fill = Country.or.region)) +  geom_bar(stat="identity",position="dodge") + 
          xlab("Country or Region") + ylab("Scores")

# GDP
ggplot(happy20_total, aes(x = "", y = GDP.per.capita, fill = Country.or.region)) +  geom_bar(stat="identity",position="dodge") + 
          xlab("Country or Region") + ylab("GDP Per Capita")

# Total
ggplot(happy20_total, aes(x = "", y = Total, fill = Country.or.region)) +  
          geom_bar(stat="identity",position="dodge") + 
          xlab("Country or Region") + ylab("Total")

There is no correlation between the Happiness Score and the GDP per capita. Using the GDP per capita analysis on this data set is ineffective. In order for this data to become more accurate, we would need to consider one important variable: The classes of people. GDP per capita measures the quality of life, but we must take into consideration the classes of people (upper, lower, and middle class) and income levels.

Dataset 3

I used Jered Ataky’s dataset that he retrieved from an article called Students Performance in Exams on kaggle.com. I saved the csv file on my github page. We will check to see if there is a correlation between students performance and parental level of education.

# Upload data set

students<-read.csv("https://raw.githubusercontent.com/sagreen131/DATA-607-Project-2/main/StudentsPerformance.csv", header=TRUE, stringsAsFactors=FALSE)

head(students)
##   gender race.ethnicity parental.level.of.education        lunch
## 1 female        group B           bachelor's degree     standard
## 2 female        group C                some college     standard
## 3 female        group B             master's degree     standard
## 4   male        group A          associate's degree free/reduced
## 5   male        group C                some college     standard
## 6 female        group B          associate's degree     standard
##   test.preparation.course math.score reading.score writing.score
## 1                    none         72            72            74
## 2               completed         69            90            88
## 3                    none         90            95            93
## 4                    none         47            57            44
## 5                    none         76            78            75
## 6                    none         71            83            78

Let’s clean up and rename the columns for neatness!

names(students)[1] <- "Gender"
names(students)[2] <- "Race.Ethnicity"
names(students)[3] <- "Parental.LOE"
names(students)[4] <- "Lunch"
names(students)[5]<- "Test.Prep"
names(students)[6] <- "Math"
names(students)[7] <- "Reading"
names(students)[8] <- "Writing"

head(students)
##   Gender Race.Ethnicity       Parental.LOE        Lunch Test.Prep Math Reading
## 1 female        group B  bachelor's degree     standard      none   72      72
## 2 female        group C       some college     standard completed   69      90
## 3 female        group B    master's degree     standard      none   90      95
## 4   male        group A associate's degree free/reduced      none   47      57
## 5   male        group C       some college     standard      none   76      78
## 6 female        group B associate's degree     standard      none   71      83
##   Writing
## 1      74
## 2      88
## 3      93
## 4      44
## 5      75
## 6      78

Since there are one thousand observations with only 8 variables, it’s safe to say we can spread the data set some more.

students_wide<-spread(students, Parental.LOE, Test.Prep)
head(students_wide)
##   Gender Race.Ethnicity        Lunch Math Reading Writing associate's degree
## 1 female        group A free/reduced   34      48      41               <NA>
## 2 female        group A free/reduced   37      57      56               none
## 3 female        group A free/reduced   38      43      43               <NA>
## 4 female        group A free/reduced   41      51      48               none
## 5 female        group A free/reduced   44      45      45               <NA>
## 6 female        group A free/reduced   44      64      58               <NA>
##   bachelor's degree high school master's degree some college some high school
## 1              <NA>   completed            <NA>         <NA>             <NA>
## 2              <NA>        <NA>            <NA>         <NA>             <NA>
## 3              <NA>        <NA>            <NA>         <NA>             none
## 4              <NA>        <NA>            <NA>         <NA>             <NA>
## 5              <NA>        <NA>            <NA>         <NA>             none
## 6              <NA>        <NA>            <NA>         <NA>             none

Let’s sort out the scores for Math, Reading and Writing:

# Math 
students_M<-students[order(students$Math, decreasing = TRUE),]
head(students_M)
##     Gender Race.Ethnicity       Parental.LOE        Lunch Test.Prep Math
## 150   male        group E associate's degree free/reduced completed  100
## 452 female        group E       some college     standard      none  100
## 459 female        group E  bachelor's degree     standard      none  100
## 624   male        group A       some college     standard completed  100
## 626   male        group D       some college     standard completed  100
## 917   male        group E  bachelor's degree     standard completed  100
##     Reading Writing
## 150     100      93
## 452      92      97
## 459     100     100
## 624      96      86
## 626      97      99
## 917     100     100

By sorting out the data for Math, group E students have taken the top three spots here. Of the three, two are female. If we take a look at the 6th indivdual listed here, he has received 100 percent in all three categories!

# Reading
students_R<-students[order(students$Reading, decreasing = TRUE),]
head(students_R)
##     Gender Race.Ethnicity       Parental.LOE        Lunch Test.Prep Math
## 107 female        group D    master's degree     standard      none   87
## 115 female        group E  bachelor's degree     standard completed   99
## 150   male        group E associate's degree free/reduced completed  100
## 166 female        group C  bachelor's degree     standard completed   96
## 180 female        group D   some high school     standard completed   97
## 382   male        group C associate's degree     standard completed   87
##     Reading Writing
## 107     100     100
## 115     100     100
## 150     100      93
## 166     100     100
## 180     100     100
## 382     100      95

By sorting out the data for Reading, it is split between group C,D, and E. More females have taken the spots in this scenario. The student who is at the top of the list has a parent who has a Master’s Degree. This is the second case I came across when comparing all three subjects.

# Writing 
students_W<-students[order(students$Writing, decreasing = TRUE),]
head(students_W)
##     Gender Race.Ethnicity      Parental.LOE        Lunch Test.Prep Math Reading
## 107 female        group D   master's degree     standard      none   87     100
## 115 female        group E bachelor's degree     standard completed   99     100
## 166 female        group C bachelor's degree     standard completed   96     100
## 180 female        group D  some high school     standard completed   97     100
## 378 female        group D   master's degree free/reduced completed   85      95
## 404 female        group D       high school     standard completed   88      99
##     Writing
## 107     100
## 115     100
## 166     100
## 180     100
## 378     100
## 404     100

By sorting out the data for Writing, the females have dominated the first six spots. In comparison to Reading, the top student also has a parent who has obtained a Master’s Degree. In addition to this, this is the second student who did not have any Test Prep.

I will create a new column to to average the three subjects and attach it to the data frame.

students <- students %>%
    mutate(MRW.avg= (Math + Reading + Writing)/3)

head(students)
##   Gender Race.Ethnicity       Parental.LOE        Lunch Test.Prep Math Reading
## 1 female        group B  bachelor's degree     standard      none   72      72
## 2 female        group C       some college     standard completed   69      90
## 3 female        group B    master's degree     standard      none   90      95
## 4   male        group A associate's degree free/reduced      none   47      57
## 5   male        group C       some college     standard      none   76      78
## 6 female        group B associate's degree     standard      none   71      83
##   Writing  MRW.avg
## 1      74 72.66667
## 2      88 82.33333
## 3      93 92.66667
## 4      44 49.33333
## 5      75 76.33333
## 6      78 77.33333
students_avg<-select(students, Gender, Parental.LOE, Math, Reading, Writing, MRW.avg)
head(students_avg)
##   Gender       Parental.LOE Math Reading Writing  MRW.avg
## 1 female  bachelor's degree   72      72      74 72.66667
## 2 female       some college   69      90      88 82.33333
## 3 female    master's degree   90      95      93 92.66667
## 4   male associate's degree   47      57      44 49.33333
## 5   male       some college   76      78      75 76.33333
## 6 female associate's degree   71      83      78 77.33333
students_avg<-students_avg[order(students_avg$MRW.avg, decreasing = TRUE),]
head(students_avg)
##     Gender       Parental.LOE Math Reading Writing   MRW.avg
## 459 female  bachelor's degree  100     100     100 100.00000
## 917   male  bachelor's degree  100     100     100 100.00000
## 963 female associate's degree  100     100     100 100.00000
## 115 female  bachelor's degree   99     100     100  99.66667
## 180 female   some high school   97     100     100  99.00000
## 713 female       some college   98     100      99  99.00000
ggplot(students_avg, aes(x = "", y = MRW.avg, fill = Parental.LOE)) +  geom_bar(stat="identity",position="dodge") + 
          xlab("Parental Level of Education") + ylab("Average Scores")

After completing the analysis, there is correlation between student’s performance and parental education. There are several factors that may need to take place for a more thorough analysis.

  1. Demographics
  2. Single family vs. Double Family?
  3. One Parent vs. Two Parent’s Education levels

Githubhere

RPubs here