The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.
Your task is to:
Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the informatIon appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
The URL to the .Rmd file in your GitHub repository, and
The URL for your rpubs.com web page.
library(knitr)
library(stringr)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
For the first dataset, I wanted to use the original file from FiveThirtyEight article Should Travelers Avoid Flying Airplanes That Have Had Crashes in the Past?. I chose this as my discussion post and I wanted to compare the number of incidents, fatal incidents, and fatalities. I wanted to see if there was a correlation between the time frame of airline safety over a span of thirty years.
# Upload data set
air_safe<-read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv", header = FALSE, stringsAsFactors = FALSE)
head(air_safe)
## V1 V2 V3
## 1 airline avail_seat_km_per_week incidents_85_99
## 2 Aer Lingus 320906734 2
## 3 Aeroflot* 1197672318 76
## 4 Aerolineas Argentinas 385803648 6
## 5 Aeromexico* 596871813 3
## 6 Air Canada 1865253802 2
## V4 V5 V6 V7
## 1 fatal_accidents_85_99 fatalities_85_99 incidents_00_14 fatal_accidents_00_14
## 2 0 0 0 0
## 3 14 128 6 1
## 4 0 0 1 0
## 5 1 64 5 0
## 6 0 0 2 0
## V8
## 1 fatalities_00_14
## 2 0
## 3 88
## 4 0
## 5 0
## 6 0
summary(air_safe)
## V1 V2 V3 V4
## Length:57 Length:57 Length:57 Length:57
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## V5 V6 V7 V8
## Length:57 Length:57 Length:57 Length:57
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
As we can see here, we need to do some data cleaning before doing our analysis.
# Renaming row header
names(air_safe) <- air_safe[1,]
head(air_safe)
## airline avail_seat_km_per_week incidents_85_99
## 1 airline avail_seat_km_per_week incidents_85_99
## 2 Aer Lingus 320906734 2
## 3 Aeroflot* 1197672318 76
## 4 Aerolineas Argentinas 385803648 6
## 5 Aeromexico* 596871813 3
## 6 Air Canada 1865253802 2
## fatal_accidents_85_99 fatalities_85_99 incidents_00_14 fatal_accidents_00_14
## 1 fatal_accidents_85_99 fatalities_85_99 incidents_00_14 fatal_accidents_00_14
## 2 0 0 0 0
## 3 14 128 6 1
## 4 0 0 1 0
## 5 1 64 5 0
## 6 0 0 2 0
## fatalities_00_14
## 1 fatalities_00_14
## 2 0
## 3 88
## 4 0
## 5 0
## 6 0
#Remove row 1
air_safe <- air_safe[-c(1),]
head(air_safe)
## airline avail_seat_km_per_week incidents_85_99
## 2 Aer Lingus 320906734 2
## 3 Aeroflot* 1197672318 76
## 4 Aerolineas Argentinas 385803648 6
## 5 Aeromexico* 596871813 3
## 6 Air Canada 1865253802 2
## 7 Air France 3004002661 14
## fatal_accidents_85_99 fatalities_85_99 incidents_00_14 fatal_accidents_00_14
## 2 0 0 0 0
## 3 14 128 6 1
## 4 0 0 1 0
## 5 1 64 5 0
## 6 0 0 2 0
## 7 4 79 6 2
## fatalities_00_14
## 2 0
## 3 88
## 4 0
## 5 0
## 6 0
## 7 337
I noticed that in order for me to mutate and add additional variables, I have to convert the chr columns that are numeric into numeric.
# Convert from chr to numeric
air_safe$avail_seat_km_per_week <- as.numeric(as.character(air_safe$avail_seat_km_per_week))
air_safe$incidents_85_99<-as.numeric(as.character(air_safe$incidents_85_99))
air_safe$fatal_accidents_85_99<-as.numeric(as.character(air_safe$fatal_accidents_85_99))
air_safe$fatalities_85_99<-as.numeric(as.character(air_safe$fatalities_85_99))
air_safe$incidents_00_14<-as.numeric(as.character(air_safe$incidents_00_14))
air_safe$fatal_accidents_00_14<-as.numeric(as.character(air_safe$fatal_accidents_00_14))
air_safe$fatalities_00_14<-as.numeric(as.character(air_safe$fatalities_00_14))
head(air_safe)
## airline avail_seat_km_per_week incidents_85_99
## 2 Aer Lingus 320906734 2
## 3 Aeroflot* 1197672318 76
## 4 Aerolineas Argentinas 385803648 6
## 5 Aeromexico* 596871813 3
## 6 Air Canada 1865253802 2
## 7 Air France 3004002661 14
## fatal_accidents_85_99 fatalities_85_99 incidents_00_14 fatal_accidents_00_14
## 2 0 0 0 0
## 3 14 128 6 1
## 4 0 0 1 0
## 5 1 64 5 0
## 6 0 0 2 0
## 7 4 79 6 2
## fatalities_00_14
## 2 0
## 3 88
## 4 0
## 5 0
## 6 0
## 7 337
Let’s create a total tab that adds the total number of incidents, fatal accidents, and fatalities in 1985 to 1999:
air_safe <- air_safe %>%
mutate(total_85_99 = incidents_85_99 + fatal_accidents_85_99 + fatalities_85_99, total_00_14 = incidents_00_14 + fatal_accidents_00_14 + fatalities_00_14)
head(air_safe)
## airline avail_seat_km_per_week incidents_85_99
## 1 Aer Lingus 320906734 2
## 2 Aeroflot* 1197672318 76
## 3 Aerolineas Argentinas 385803648 6
## 4 Aeromexico* 596871813 3
## 5 Air Canada 1865253802 2
## 6 Air France 3004002661 14
## fatal_accidents_85_99 fatalities_85_99 incidents_00_14 fatal_accidents_00_14
## 1 0 0 0 0
## 2 14 128 6 1
## 3 0 0 1 0
## 4 1 64 5 0
## 5 0 0 2 0
## 6 4 79 6 2
## fatalities_00_14 total_85_99 total_00_14
## 1 0 2 0
## 2 88 218 95
## 3 0 6 1
## 4 0 68 5
## 5 0 2 2
## 6 337 97 345
I wanted to create a new dataset that focused only on the airline and their totals from 85-99 and 00-14
air_total<-select(air_safe, airline, total_85_99, total_00_14)
head(air_total)
## airline total_85_99 total_00_14
## 1 Aer Lingus 2 0
## 2 Aeroflot* 218 95
## 3 Aerolineas Argentinas 6 1
## 4 Aeromexico* 68 5
## 5 Air Canada 2 2
## 6 Air France 97 345
summary(air_total)
## airline total_85_99 total_00_14
## Length:56 Min. : 0.00 Min. : 0.0
## Class :character 1st Qu.: 4.75 1st Qu.: 1.0
## Mode :character Median : 55.00 Median : 5.0
## Mean :121.77 Mean : 60.3
## 3rd Qu.:221.50 3rd Qu.: 88.0
## Max. :553.00 Max. :542.0
I used the spread function that I created for air_total to display the Airline numbers in a wide data set.
# Wide from 85 to 99
air_wide85 <-spread(air_total, airline, total_85_99)
head(air_wide85)
## total_00_14 Aer Lingus Aeroflot* Aerolineas Argentinas Aeromexico* Air Canada
## 1 0 2 NA NA NA NA
## 2 1 NA NA 6 NA NA
## 3 2 NA NA NA NA 2
## 4 3 NA NA NA NA NA
## 5 4 NA NA NA NA NA
## 6 5 NA NA NA 68 NA
## Air France Air India* Air New Zealand* Alaska Airlines* Alitalia
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA 59
## 6 NA NA NA NA NA
## All Nippon Airways American* Austrian Airlines Avianca British Airways*
## 1 NA NA NA 331 NA
## 2 NA NA 1 NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## Cathay Pacific* China Airlines Condor COPA Delta / Northwest* Egyptair El Al
## 1 NA NA 19 51 NA NA NA
## 2 NA NA NA NA NA NA 6
## 3 0 NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA
## Ethiopian Airlines Finnair Garuda Indonesia Gulf Air Hawaiian Airlines Iberia
## 1 NA 1 NA NA NA NA
## 2 NA NA NA NA 0 NA
## 3 NA NA NA NA NA NA
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 NA NA NA NA NA 153
## Japan Airlines Kenya Airways KLM* Korean Air LAN Airlines Lufthansa*
## 1 524 NA NA NA 26 NA
## 2 NA NA 11 442 NA NA
## 3 NA NA NA NA NA NA
## 4 NA NA NA NA NA 9
## 5 NA NA NA NA NA NA
## 6 NA NA NA NA NA NA
## Malaysia Airlines Pakistan International Philippine Airlines Qantas*
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA 85 NA
## 6 NA NA NA 1
## Royal Air Maroc SAS* Saudi Arabian Singapore Airlines South African
## 1 NA NA NA NA NA
## 2 NA NA NA NA 162
## 3 NA NA NA NA NA
## 4 59 NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## Southwest Airlines Sri Lankan / AirLanka SWISS* TACA TAM TAP - Air Portugal
## 1 NA NA NA NA NA 0
## 2 NA NA NA NA NA NA
## 3 NA NA NA NA NA NA
## 4 NA NA 232 NA NA NA
## 5 NA 17 NA NA NA NA
## 6 NA NA NA 7 NA NA
## Thai Airways Turkish Airlines United / Continental*
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 320 NA NA
## 6 NA NA NA
## US Airways / America West* Vietnam Airlines Virgin Atlantic Xiamen Airlines
## 1 NA NA 1 NA
## 2 NA 181 NA NA
## 3 NA NA NA 92
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
summary(air_wide85)
## total_00_14 Aer Lingus Aeroflot* Aerolineas Argentinas Aeromexico*
## Min. : 0.00 Min. :2 Min. :218 Min. :6 Min. :68
## 1st Qu.: 7.25 1st Qu.:2 1st Qu.:218 1st Qu.:6 1st Qu.:68
## Median : 67.50 Median :2 Median :218 Median :6 Median :68
## Mean :107.97 Mean :2 Mean :218 Mean :6 Mean :68
## 3rd Qu.:141.50 3rd Qu.:2 3rd Qu.:218 3rd Qu.:6 3rd Qu.:68
## Max. :542.00 Max. :2 Max. :218 Max. :6 Max. :68
## NA's :29 NA's :29 NA's :29 NA's :29
## Air Canada Air France Air India* Air New Zealand* Alaska Airlines*
## Min. :2 Min. :97 Min. :332 Min. :3 Min. :5
## 1st Qu.:2 1st Qu.:97 1st Qu.:332 1st Qu.:3 1st Qu.:5
## Median :2 Median :97 Median :332 Median :3 Median :5
## Mean :2 Mean :97 Mean :332 Mean :3 Mean :5
## 3rd Qu.:2 3rd Qu.:97 3rd Qu.:332 3rd Qu.:3 3rd Qu.:5
## Max. :2 Max. :97 Max. :332 Max. :3 Max. :5
## NA's :29 NA's :29 NA's :29 NA's :29 NA's :29
## Alitalia All Nippon Airways American* Austrian Airlines Avianca
## Min. :59 Min. :5 Min. :127 Min. :1 Min. :331
## 1st Qu.:59 1st Qu.:5 1st Qu.:127 1st Qu.:1 1st Qu.:331
## Median :59 Median :5 Median :127 Median :1 Median :331
## Mean :59 Mean :5 Mean :127 Mean :1 Mean :331
## 3rd Qu.:59 3rd Qu.:5 3rd Qu.:127 3rd Qu.:1 3rd Qu.:331
## Max. :59 Max. :5 Max. :127 Max. :1 Max. :331
## NA's :29 NA's :29 NA's :29 NA's :29 NA's :29
## British Airways* Cathay Pacific* China Airlines Condor COPA
## Min. :4 Min. :0 Min. :553 Min. :19 Min. :51
## 1st Qu.:4 1st Qu.:0 1st Qu.:553 1st Qu.:19 1st Qu.:51
## Median :4 Median :0 Median :553 Median :19 Median :51
## Mean :4 Mean :0 Mean :553 Mean :19 Mean :51
## 3rd Qu.:4 3rd Qu.:0 3rd Qu.:553 3rd Qu.:19 3rd Qu.:51
## Max. :4 Max. :0 Max. :553 Max. :19 Max. :51
## NA's :29 NA's :29 NA's :29 NA's :29 NA's :29
## Delta / Northwest* Egyptair El Al Ethiopian Airlines Finnair
## Min. :443 Min. :293 Min. :6 Min. :197 Min. :1
## 1st Qu.:443 1st Qu.:293 1st Qu.:6 1st Qu.:197 1st Qu.:1
## Median :443 Median :293 Median :6 Median :197 Median :1
## Mean :443 Mean :293 Mean :6 Mean :197 Mean :1
## 3rd Qu.:443 3rd Qu.:293 3rd Qu.:6 3rd Qu.:197 3rd Qu.:1
## Max. :443 Max. :293 Max. :6 Max. :197 Max. :1
## NA's :29 NA's :29 NA's :29 NA's :29 NA's :29
## Garuda Indonesia Gulf Air Hawaiian Airlines Iberia Japan Airlines
## Min. :273 Min. :1 Min. :0 Min. :153 Min. :524
## 1st Qu.:273 1st Qu.:1 1st Qu.:0 1st Qu.:153 1st Qu.:524
## Median :273 Median :1 Median :0 Median :153 Median :524
## Mean :273 Mean :1 Mean :0 Mean :153 Mean :524
## 3rd Qu.:273 3rd Qu.:1 3rd Qu.:0 3rd Qu.:153 3rd Qu.:524
## Max. :273 Max. :1 Max. :0 Max. :153 Max. :524
## NA's :29 NA's :29 NA's :29 NA's :29 NA's :29
## Kenya Airways KLM* Korean Air LAN Airlines Lufthansa*
## Min. :2 Min. :11 Min. :442 Min. :26 Min. :9
## 1st Qu.:2 1st Qu.:11 1st Qu.:442 1st Qu.:26 1st Qu.:9
## Median :2 Median :11 Median :442 Median :26 Median :9
## Mean :2 Mean :11 Mean :442 Mean :26 Mean :9
## 3rd Qu.:2 3rd Qu.:11 3rd Qu.:442 3rd Qu.:26 3rd Qu.:9
## Max. :2 Max. :11 Max. :442 Max. :26 Max. :9
## NA's :29 NA's :29 NA's :29 NA's :29 NA's :29
## Malaysia Airlines Pakistan International Philippine Airlines Qantas*
## Min. :38 Min. :245 Min. :85 Min. :1
## 1st Qu.:38 1st Qu.:245 1st Qu.:85 1st Qu.:1
## Median :38 Median :245 Median :85 Median :1
## Mean :38 Mean :245 Mean :85 Mean :1
## 3rd Qu.:38 3rd Qu.:245 3rd Qu.:85 3rd Qu.:1
## Max. :38 Max. :245 Max. :85 Max. :1
## NA's :29 NA's :29 NA's :29 NA's :29
## Royal Air Maroc SAS* Saudi Arabian Singapore Airlines South African
## Min. :59 Min. :5 Min. :322 Min. :10 Min. :162
## 1st Qu.:59 1st Qu.:5 1st Qu.:322 1st Qu.:10 1st Qu.:162
## Median :59 Median :5 Median :322 Median :10 Median :162
## Mean :59 Mean :5 Mean :322 Mean :10 Mean :162
## 3rd Qu.:59 3rd Qu.:5 3rd Qu.:322 3rd Qu.:10 3rd Qu.:162
## Max. :59 Max. :5 Max. :322 Max. :10 Max. :162
## NA's :29 NA's :29 NA's :29 NA's :29 NA's :29
## Southwest Airlines Sri Lankan / AirLanka SWISS* TACA
## Min. :1 Min. :17 Min. :232 Min. :7
## 1st Qu.:1 1st Qu.:17 1st Qu.:232 1st Qu.:7
## Median :1 Median :17 Median :232 Median :7
## Mean :1 Mean :17 Mean :232 Mean :7
## 3rd Qu.:1 3rd Qu.:17 3rd Qu.:232 3rd Qu.:7
## Max. :1 Max. :17 Max. :232 Max. :7
## NA's :29 NA's :29 NA's :29 NA's :29
## TAM TAP - Air Portugal Thai Airways Turkish Airlines
## Min. :109 Min. :0 Min. :320 Min. :75
## 1st Qu.:109 1st Qu.:0 1st Qu.:320 1st Qu.:75
## Median :109 Median :0 Median :320 Median :75
## Mean :109 Mean :0 Mean :320 Mean :75
## 3rd Qu.:109 3rd Qu.:0 3rd Qu.:320 3rd Qu.:75
## Max. :109 Max. :0 Max. :320 Max. :75
## NA's :29 NA's :29 NA's :29 NA's :29
## United / Continental* US Airways / America West* Vietnam Airlines
## Min. :346 Min. :247 Min. :181
## 1st Qu.:346 1st Qu.:247 1st Qu.:181
## Median :346 Median :247 Median :181
## Mean :346 Mean :247 Mean :181
## 3rd Qu.:346 3rd Qu.:247 3rd Qu.:181
## Max. :346 Max. :247 Max. :181
## NA's :29 NA's :29 NA's :29
## Virgin Atlantic Xiamen Airlines
## Min. :1 Min. :92
## 1st Qu.:1 1st Qu.:92
## Median :1 Median :92
## Mean :1 Mean :92
## 3rd Qu.:1 3rd Qu.:92
## Max. :1 Max. :92
## NA's :29 NA's :29
# Wide from 85 to 99
air_wide00 <-spread(air_total, airline, total_00_14)
head(air_wide00)
## total_85_99 Aer Lingus Aeroflot* Aerolineas Argentinas Aeromexico* Air Canada
## 1 0 NA NA NA NA NA
## 2 1 NA NA NA NA NA
## 3 2 0 NA NA NA 2
## 4 3 NA NA NA NA NA
## 5 4 NA NA NA NA NA
## 6 5 NA NA NA NA NA
## Air France Air India* Air New Zealand* Alaska Airlines* Alitalia
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA 13 NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA 94 NA
## All Nippon Airways American* Austrian Airlines Avianca British Airways*
## 1 NA NA NA NA NA
## 2 NA NA 1 NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA 6
## 6 7 NA NA NA NA
## Cathay Pacific* China Airlines Condor COPA Delta / Northwest* Egyptair El Al
## 1 2 NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA
## Ethiopian Airlines Finnair Garuda Indonesia Gulf Air Hawaiian Airlines Iberia
## 1 NA NA NA NA 1 NA
## 2 NA 0 NA 147 NA NA
## 3 NA NA NA NA NA NA
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 NA NA NA NA NA NA
## Japan Airlines Kenya Airways KLM* Korean Air LAN Airlines Lufthansa*
## 1 NA NA NA NA NA NA
## 2 NA NA NA NA NA NA
## 3 NA 287 NA NA NA NA
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 NA NA NA NA NA NA
## Malaysia Airlines Pakistan International Philippine Airlines Qantas*
## 1 NA NA NA NA
## 2 NA NA NA 5
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## Royal Air Maroc SAS* Saudi Arabian Singapore Airlines South African
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA 117 NA NA NA
## Southwest Airlines Sri Lankan / AirLanka SWISS* TACA TAM TAP - Air Portugal
## 1 NA NA NA NA NA 0
## 2 8 NA NA NA NA NA
## 3 NA NA NA NA NA NA
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 NA NA NA NA NA NA
## Thai Airways Turkish Airlines United / Continental*
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## US Airways / America West* Vietnam Airlines Virgin Atlantic Xiamen Airlines
## 1 NA NA NA NA
## 2 NA NA 0 NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
summary(air_wide00)
## total_85_99 Aer Lingus Aeroflot* Aerolineas Argentinas Aeromexico*
## Min. : 0.0 Min. :0 Min. :95 Min. :1 Min. :5
## 1st Qu.: 14.0 1st Qu.:0 1st Qu.:95 1st Qu.:1 1st Qu.:5
## Median : 97.0 Median :0 Median :95 Median :1 Median :5
## Mean :156.6 Mean :0 Mean :95 Mean :1 Mean :5
## 3rd Qu.:260.0 3rd Qu.:0 3rd Qu.:95 3rd Qu.:1 3rd Qu.:5
## Max. :553.0 Max. :0 Max. :95 Max. :1 Max. :5
## NA's :42 NA's :42 NA's :42 NA's :42
## Air Canada Air France Air India* Air New Zealand* Alaska Airlines*
## Min. :2 Min. :345 Min. :163 Min. :13 Min. :94
## 1st Qu.:2 1st Qu.:345 1st Qu.:163 1st Qu.:13 1st Qu.:94
## Median :2 Median :345 Median :163 Median :13 Median :94
## Mean :2 Mean :345 Mean :163 Mean :13 Mean :94
## 3rd Qu.:2 3rd Qu.:345 3rd Qu.:163 3rd Qu.:13 3rd Qu.:94
## Max. :2 Max. :345 Max. :163 Max. :13 Max. :94
## NA's :42 NA's :42 NA's :42 NA's :42 NA's :42
## Alitalia All Nippon Airways American* Austrian Airlines Avianca
## Min. :4 Min. :7 Min. :436 Min. :1 Min. :0
## 1st Qu.:4 1st Qu.:7 1st Qu.:436 1st Qu.:1 1st Qu.:0
## Median :4 Median :7 Median :436 Median :1 Median :0
## Mean :4 Mean :7 Mean :436 Mean :1 Mean :0
## 3rd Qu.:4 3rd Qu.:7 3rd Qu.:436 3rd Qu.:1 3rd Qu.:0
## Max. :4 Max. :7 Max. :436 Max. :1 Max. :0
## NA's :42 NA's :42 NA's :42 NA's :42 NA's :42
## British Airways* Cathay Pacific* China Airlines Condor COPA
## Min. :6 Min. :2 Min. :228 Min. :0 Min. :0
## 1st Qu.:6 1st Qu.:2 1st Qu.:228 1st Qu.:0 1st Qu.:0
## Median :6 Median :2 Median :228 Median :0 Median :0
## Mean :6 Mean :2 Mean :228 Mean :0 Mean :0
## 3rd Qu.:6 3rd Qu.:2 3rd Qu.:228 3rd Qu.:0 3rd Qu.:0
## Max. :6 Max. :2 Max. :228 Max. :0 Max. :0
## NA's :42 NA's :42 NA's :42 NA's :42 NA's :42
## Delta / Northwest* Egyptair El Al Ethiopian Airlines Finnair
## Min. :77 Min. :19 Min. :1 Min. :99 Min. :0
## 1st Qu.:77 1st Qu.:19 1st Qu.:1 1st Qu.:99 1st Qu.:0
## Median :77 Median :19 Median :1 Median :99 Median :0
## Mean :77 Mean :19 Mean :1 Mean :99 Mean :0
## 3rd Qu.:77 3rd Qu.:19 3rd Qu.:1 3rd Qu.:99 3rd Qu.:0
## Max. :77 Max. :19 Max. :1 Max. :99 Max. :0
## NA's :42 NA's :42 NA's :42 NA's :42 NA's :42
## Garuda Indonesia Gulf Air Hawaiian Airlines Iberia Japan Airlines
## Min. :28 Min. :147 Min. :1 Min. :5 Min. :0
## 1st Qu.:28 1st Qu.:147 1st Qu.:1 1st Qu.:5 1st Qu.:0
## Median :28 Median :147 Median :1 Median :5 Median :0
## Mean :28 Mean :147 Mean :1 Mean :5 Mean :0
## 3rd Qu.:28 3rd Qu.:147 3rd Qu.:1 3rd Qu.:5 3rd Qu.:0
## Max. :28 Max. :147 Max. :1 Max. :5 Max. :0
## NA's :42 NA's :42 NA's :42 NA's :42 NA's :42
## Kenya Airways KLM* Korean Air LAN Airlines Lufthansa*
## Min. :287 Min. :1 Min. :1 Min. :0 Min. :3
## 1st Qu.:287 1st Qu.:1 1st Qu.:1 1st Qu.:0 1st Qu.:3
## Median :287 Median :1 Median :1 Median :0 Median :3
## Mean :287 Mean :1 Mean :1 Mean :0 Mean :3
## 3rd Qu.:287 3rd Qu.:1 3rd Qu.:1 3rd Qu.:0 3rd Qu.:3
## Max. :287 Max. :1 Max. :1 Max. :0 Max. :3
## NA's :42 NA's :42 NA's :42 NA's :42 NA's :42
## Malaysia Airlines Pakistan International Philippine Airlines Qantas*
## Min. :542 Min. :58 Min. :4 Min. :5
## 1st Qu.:542 1st Qu.:58 1st Qu.:4 1st Qu.:5
## Median :542 Median :58 Median :4 Median :5
## Mean :542 Mean :58 Mean :4 Mean :5
## 3rd Qu.:542 3rd Qu.:58 3rd Qu.:4 3rd Qu.:5
## Max. :542 Max. :58 Max. :4 Max. :5
## NA's :42 NA's :42 NA's :42 NA's :42
## Royal Air Maroc SAS* Saudi Arabian Singapore Airlines South African
## Min. :3 Min. :117 Min. :11 Min. :86 Min. :1
## 1st Qu.:3 1st Qu.:117 1st Qu.:11 1st Qu.:86 1st Qu.:1
## Median :3 Median :117 Median :11 Median :86 Median :1
## Mean :3 Mean :117 Mean :11 Mean :86 Mean :1
## 3rd Qu.:3 3rd Qu.:117 3rd Qu.:11 3rd Qu.:86 3rd Qu.:1
## Max. :3 Max. :117 Max. :11 Max. :86 Max. :1
## NA's :42 NA's :42 NA's :42 NA's :42 NA's :42
## Southwest Airlines Sri Lankan / AirLanka SWISS* TACA
## Min. :8 Min. :4 Min. :3 Min. :5
## 1st Qu.:8 1st Qu.:4 1st Qu.:3 1st Qu.:5
## Median :8 Median :4 Median :3 Median :5
## Mean :8 Mean :4 Mean :3 Mean :5
## 3rd Qu.:8 3rd Qu.:4 3rd Qu.:3 3rd Qu.:5
## Max. :8 Max. :4 Max. :3 Max. :5
## NA's :42 NA's :42 NA's :42 NA's :42
## TAM TAP - Air Portugal Thai Airways Turkish Airlines
## Min. :197 Min. :0 Min. :4 Min. :94
## 1st Qu.:197 1st Qu.:0 1st Qu.:4 1st Qu.:94
## Median :197 Median :0 Median :4 Median :94
## Mean :197 Mean :0 Mean :4 Mean :94
## 3rd Qu.:197 3rd Qu.:0 3rd Qu.:4 3rd Qu.:94
## Max. :197 Max. :0 Max. :4 Max. :94
## NA's :42 NA's :42 NA's :42 NA's :42
## United / Continental* US Airways / America West* Vietnam Airlines
## Min. :125 Min. :36 Min. :1
## 1st Qu.:125 1st Qu.:36 1st Qu.:1
## Median :125 Median :36 Median :1
## Mean :125 Mean :36 Mean :1
## 3rd Qu.:125 3rd Qu.:36 3rd Qu.:1
## Max. :125 Max. :36 Max. :1
## NA's :42 NA's :42 NA's :42
## Virgin Atlantic Xiamen Airlines
## Min. :0 Min. :2
## 1st Qu.:0 1st Qu.:2
## Median :0 Median :2
## Mean :0 Mean :2
## 3rd Qu.:0 3rd Qu.:2
## Max. :0 Max. :2
## NA's :42 NA's :42
I wanted to sort who had the most incidents, fatal accidents, and fatalities between 1985 and 1999.
air_max85<-air_total[order(air_total$total_85_99, decreasing = TRUE),]
head(air_max85)
## airline total_85_99 total_00_14
## 17 China Airlines 553 228
## 29 Japan Airlines 524 0
## 20 Delta / Northwest* 443 77
## 32 Korean Air 442 1
## 52 United / Continental* 346 125
## 7 Air India* 332 163
summary(air_max85)
## airline total_85_99 total_00_14
## Length:56 Min. : 0.00 Min. : 0.0
## Class :character 1st Qu.: 4.75 1st Qu.: 1.0
## Mode :character Median : 55.00 Median : 5.0
## Mean :121.77 Mean : 60.3
## 3rd Qu.:221.50 3rd Qu.: 88.0
## Max. :553.00 Max. :542.0
According to the data, China Airlines had the most combined incidents, fatal incidents, and fatalities from 1985 to 1999.
air_max00<-air_total[order(air_total$total_00_14, decreasing = TRUE),]
head(air_max00)
## airline total_85_99 total_00_14
## 35 Malaysia Airlines 38 542
## 12 American* 127 436
## 6 Air France 97 345
## 30 Kenya Airways 2 287
## 17 China Airlines 553 228
## 48 TAM 109 197
On the other hand, Malaysia Airlines had the most combined incidents, fatal incidents, and fatalities from 2000 to 2014. China Airlines, however is in the top 5. This shows very little improvement over the span of 30 years.
ggplot(air_total, aes(x = "", y = total_85_99, fill = airline)) +
geom_bar(stat="identity",position="dodge") +
xlab("Airlines") + ylab("Incidents, Fatal Incidents, & Fatalities 1985 to 1999")
ggplot(air_total, aes(x = "", y = total_00_14, fill = airline)) +
geom_bar(stat="identity",position="dodge") +
xlab("Airlines") + ylab("Incidents, Fatal Incidents, & Fatalities 2000 to 2014")
I highly recommend looking and introducing flight times because the time of the day plays a major role in incidents, fatal incidents, and fatalities. In addition to this, I also recommend including the types of disasters that have occurred.
I chose Magnus Skonberg’s dataset that he retrieved from the World Happiness Report on kaggle.com. I saved the csv file on my github page. We will compare the Happiness Score and GDP per capita for the Top 20 countries and see if there is a correlation between the two.
#Upload data set
happy<- read.csv("https://raw.githubusercontent.com/sagreen131/DATA-607-Project-2/main/2019.csv", header = TRUE, stringsAsFactors = FALSE)
head(happy)
## Overall.rank Country.or.region Score GDP.per.capita Social.support
## 1 1 Finland 7.769 1.340 1.587
## 2 2 Denmark 7.600 1.383 1.573
## 3 3 Norway 7.554 1.488 1.582
## 4 4 Iceland 7.494 1.380 1.624
## 5 5 Netherlands 7.488 1.396 1.522
## 6 6 Switzerland 7.480 1.452 1.526
## Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## 1 0.986 0.596 0.153
## 2 0.996 0.592 0.252
## 3 1.028 0.603 0.271
## 4 1.026 0.591 0.354
## 5 0.999 0.557 0.322
## 6 1.052 0.572 0.263
## Perceptions.of.corruption
## 1 0.393
## 2 0.410
## 3 0.341
## 4 0.118
## 5 0.298
## 6 0.343
summary(happy)
## Overall.rank Country.or.region Score GDP.per.capita
## Min. : 1.00 Length:156 Min. :2.853 Min. :0.0000
## 1st Qu.: 39.75 Class :character 1st Qu.:4.545 1st Qu.:0.6028
## Median : 78.50 Mode :character Median :5.380 Median :0.9600
## Mean : 78.50 Mean :5.407 Mean :0.9051
## 3rd Qu.:117.25 3rd Qu.:6.184 3rd Qu.:1.2325
## Max. :156.00 Max. :7.769 Max. :1.6840
## Social.support Healthy.life.expectancy Freedom.to.make.life.choices
## Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.056 1st Qu.:0.5477 1st Qu.:0.3080
## Median :1.272 Median :0.7890 Median :0.4170
## Mean :1.209 Mean :0.7252 Mean :0.3926
## 3rd Qu.:1.452 3rd Qu.:0.8818 3rd Qu.:0.5072
## Max. :1.624 Max. :1.1410 Max. :0.6310
## Generosity Perceptions.of.corruption
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.1087 1st Qu.:0.0470
## Median :0.1775 Median :0.0855
## Mean :0.1848 Mean :0.1106
## 3rd Qu.:0.2482 3rd Qu.:0.1412
## Max. :0.5660 Max. :0.4530
I will create a new data set and focus on the top 20 countries by overall rank.
happy20 <-happy[1:20,]
happy20
## Overall.rank Country.or.region Score GDP.per.capita Social.support
## 1 1 Finland 7.769 1.340 1.587
## 2 2 Denmark 7.600 1.383 1.573
## 3 3 Norway 7.554 1.488 1.582
## 4 4 Iceland 7.494 1.380 1.624
## 5 5 Netherlands 7.488 1.396 1.522
## 6 6 Switzerland 7.480 1.452 1.526
## 7 7 Sweden 7.343 1.387 1.487
## 8 8 New Zealand 7.307 1.303 1.557
## 9 9 Canada 7.278 1.365 1.505
## 10 10 Austria 7.246 1.376 1.475
## 11 11 Australia 7.228 1.372 1.548
## 12 12 Costa Rica 7.167 1.034 1.441
## 13 13 Israel 7.139 1.276 1.455
## 14 14 Luxembourg 7.090 1.609 1.479
## 15 15 United Kingdom 7.054 1.333 1.538
## 16 16 Ireland 7.021 1.499 1.553
## 17 17 Germany 6.985 1.373 1.454
## 18 18 Belgium 6.923 1.356 1.504
## 19 19 United States 6.892 1.433 1.457
## 20 20 Czech Republic 6.852 1.269 1.487
## Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## 1 0.986 0.596 0.153
## 2 0.996 0.592 0.252
## 3 1.028 0.603 0.271
## 4 1.026 0.591 0.354
## 5 0.999 0.557 0.322
## 6 1.052 0.572 0.263
## 7 1.009 0.574 0.267
## 8 1.026 0.585 0.330
## 9 1.039 0.584 0.285
## 10 1.016 0.532 0.244
## 11 1.036 0.557 0.332
## 12 0.963 0.558 0.144
## 13 1.029 0.371 0.261
## 14 1.012 0.526 0.194
## 15 0.996 0.450 0.348
## 16 0.999 0.516 0.298
## 17 0.987 0.495 0.261
## 18 0.986 0.473 0.160
## 19 0.874 0.454 0.280
## 20 0.920 0.457 0.046
## Perceptions.of.corruption
## 1 0.393
## 2 0.410
## 3 0.341
## 4 0.118
## 5 0.298
## 6 0.343
## 7 0.373
## 8 0.380
## 9 0.308
## 10 0.226
## 11 0.290
## 12 0.093
## 13 0.082
## 14 0.316
## 15 0.278
## 16 0.310
## 17 0.265
## 18 0.210
## 19 0.128
## 20 0.036
summary(happy20)
## Overall.rank Country.or.region Score GDP.per.capita
## Min. : 1.00 Length:20 Min. :6.852 Min. :1.034
## 1st Qu.: 5.75 Class :character 1st Qu.:7.046 1st Qu.:1.338
## Median :10.50 Mode :character Median :7.237 Median :1.375
## Mean :10.50 Mean :7.245 Mean :1.371
## 3rd Qu.:15.25 3rd Qu.:7.482 3rd Qu.:1.405
## Max. :20.00 Max. :7.769 Max. :1.609
## Social.support Healthy.life.expectancy Freedom.to.make.life.choices
## Min. :1.441 Min. :0.8740 Min. :0.3710
## 1st Qu.:1.478 1st Qu.:0.9868 1st Qu.:0.4895
## Median :1.514 Median :1.0040 Median :0.5570
## Mean :1.518 Mean :0.9990 Mean :0.5322
## 3rd Qu.:1.554 3rd Qu.:1.0265 3rd Qu.:0.5843
## Max. :1.624 Max. :1.0520 Max. :0.6030
## Generosity Perceptions.of.corruption
## Min. :0.0460 Min. :0.0360
## 1st Qu.:0.2315 1st Qu.:0.1895
## Median :0.2650 Median :0.2940
## Mean :0.2532 Mean :0.2599
## 3rd Qu.:0.3040 3rd Qu.:0.3415
## Max. :0.3540 Max. :0.4100
Upon reviewing the new data set, I wanted to first compare some of the variables in the data set.
freedom<-happy20[order(happy20$Freedom.to.make.life.choices, decreasing = TRUE),]
head(freedom)
## Overall.rank Country.or.region Score GDP.per.capita Social.support
## 3 3 Norway 7.554 1.488 1.582
## 1 1 Finland 7.769 1.340 1.587
## 2 2 Denmark 7.600 1.383 1.573
## 4 4 Iceland 7.494 1.380 1.624
## 8 8 New Zealand 7.307 1.303 1.557
## 9 9 Canada 7.278 1.365 1.505
## Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## 3 1.028 0.603 0.271
## 1 0.986 0.596 0.153
## 2 0.996 0.592 0.252
## 4 1.026 0.591 0.354
## 8 1.026 0.585 0.330
## 9 1.039 0.584 0.285
## Perceptions.of.corruption
## 3 0.341
## 1 0.393
## 2 0.410
## 4 0.118
## 8 0.380
## 9 0.308
corruption<-happy20[order(happy20$Perceptions.of.corruption, decreasing = TRUE),]
head(corruption)
## Overall.rank Country.or.region Score GDP.per.capita Social.support
## 2 2 Denmark 7.600 1.383 1.573
## 1 1 Finland 7.769 1.340 1.587
## 8 8 New Zealand 7.307 1.303 1.557
## 7 7 Sweden 7.343 1.387 1.487
## 6 6 Switzerland 7.480 1.452 1.526
## 3 3 Norway 7.554 1.488 1.582
## Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## 2 0.996 0.592 0.252
## 1 0.986 0.596 0.153
## 8 1.026 0.585 0.330
## 7 1.009 0.574 0.267
## 6 1.052 0.572 0.263
## 3 1.028 0.603 0.271
## Perceptions.of.corruption
## 2 0.410
## 1 0.393
## 8 0.380
## 7 0.373
## 6 0.343
## 3 0.341
healthy<-happy20[order(happy20$Healthy.life.expectancy, decreasing = TRUE),]
head(healthy)
## Overall.rank Country.or.region Score GDP.per.capita Social.support
## 6 6 Switzerland 7.480 1.452 1.526
## 9 9 Canada 7.278 1.365 1.505
## 11 11 Australia 7.228 1.372 1.548
## 13 13 Israel 7.139 1.276 1.455
## 3 3 Norway 7.554 1.488 1.582
## 4 4 Iceland 7.494 1.380 1.624
## Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## 6 1.052 0.572 0.263
## 9 1.039 0.584 0.285
## 11 1.036 0.557 0.332
## 13 1.029 0.371 0.261
## 3 1.028 0.603 0.271
## 4 1.026 0.591 0.354
## Perceptions.of.corruption
## 6 0.343
## 9 0.308
## 11 0.290
## 13 0.082
## 3 0.341
## 4 0.118
generous<-happy20[order(happy20$Generosity, decreasing = TRUE),]
head(generous)
## Overall.rank Country.or.region Score GDP.per.capita Social.support
## 4 4 Iceland 7.494 1.380 1.624
## 15 15 United Kingdom 7.054 1.333 1.538
## 11 11 Australia 7.228 1.372 1.548
## 8 8 New Zealand 7.307 1.303 1.557
## 5 5 Netherlands 7.488 1.396 1.522
## 16 16 Ireland 7.021 1.499 1.553
## Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## 4 1.026 0.591 0.354
## 15 0.996 0.450 0.348
## 11 1.036 0.557 0.332
## 8 1.026 0.585 0.330
## 5 0.999 0.557 0.322
## 16 0.999 0.516 0.298
## Perceptions.of.corruption
## 4 0.118
## 15 0.278
## 11 0.290
## 8 0.380
## 5 0.298
## 16 0.310
Upon reviewing the 4 variables, I noticed that Norway is in the top 10 in each category. Finland is listed in the top 10 in 2 out of 4 variables. Sweden is in 3 out of the 4 variables. United States is listed in only 1 of the 4 variables.
Let’s add up the column variable by creating a new column variable and then compare them to the Score and GDP per capita.
happy20 <- happy20 %>%
mutate(Total = Social.support + Healthy.life.expectancy + Freedom.to.make.life.choices + Generosity + Perceptions.of.corruption)
colnames(happy20)
## [1] "Overall.rank" "Country.or.region"
## [3] "Score" "GDP.per.capita"
## [5] "Social.support" "Healthy.life.expectancy"
## [7] "Freedom.to.make.life.choices" "Generosity"
## [9] "Perceptions.of.corruption" "Total"
happy20_total<-select(happy20, Country.or.region, Score, GDP.per.capita, Total)
head(happy20_total)
## Country.or.region Score GDP.per.capita Total
## 1 Finland 7.769 1.340 3.715
## 2 Denmark 7.600 1.383 3.823
## 3 Norway 7.554 1.488 3.825
## 4 Iceland 7.494 1.380 3.713
## 5 Netherlands 7.488 1.396 3.698
## 6 Switzerland 7.480 1.452 3.756
I noticed that although Finland is at the top of the list overall, the total variables column I created does not dictate that Finland has the highest overall score.
# Sort by Total
happy20_total[order(happy20_total$Total, decreasing = TRUE),]
## Country.or.region Score GDP.per.capita Total
## 8 New Zealand 7.307 1.303 3.878
## 3 Norway 7.554 1.488 3.825
## 2 Denmark 7.600 1.383 3.823
## 11 Australia 7.228 1.372 3.763
## 6 Switzerland 7.480 1.452 3.756
## 9 Canada 7.278 1.365 3.721
## 1 Finland 7.769 1.340 3.715
## 4 Iceland 7.494 1.380 3.713
## 7 Sweden 7.343 1.387 3.710
## 5 Netherlands 7.488 1.396 3.698
## 16 Ireland 7.021 1.499 3.676
## 15 United Kingdom 7.054 1.333 3.610
## 14 Luxembourg 7.090 1.609 3.527
## 10 Austria 7.246 1.376 3.493
## 17 Germany 6.985 1.373 3.462
## 18 Belgium 6.923 1.356 3.333
## 12 Costa Rica 7.167 1.034 3.199
## 13 Israel 7.139 1.276 3.198
## 19 United States 6.892 1.433 3.193
## 20 Czech Republic 6.852 1.269 2.946
# Sort by GDP
happy20_total[order(happy20_total$GDP.per.capita, decreasing = TRUE),]
## Country.or.region Score GDP.per.capita Total
## 14 Luxembourg 7.090 1.609 3.527
## 16 Ireland 7.021 1.499 3.676
## 3 Norway 7.554 1.488 3.825
## 6 Switzerland 7.480 1.452 3.756
## 19 United States 6.892 1.433 3.193
## 5 Netherlands 7.488 1.396 3.698
## 7 Sweden 7.343 1.387 3.710
## 2 Denmark 7.600 1.383 3.823
## 4 Iceland 7.494 1.380 3.713
## 10 Austria 7.246 1.376 3.493
## 17 Germany 6.985 1.373 3.462
## 11 Australia 7.228 1.372 3.763
## 9 Canada 7.278 1.365 3.721
## 18 Belgium 6.923 1.356 3.333
## 1 Finland 7.769 1.340 3.715
## 15 United Kingdom 7.054 1.333 3.610
## 8 New Zealand 7.307 1.303 3.878
## 13 Israel 7.139 1.276 3.198
## 20 Czech Republic 6.852 1.269 2.946
## 12 Costa Rica 7.167 1.034 3.199
After sorting out the data sets, Norway is in the top 3 for each category. Luxembourg, however tops the list with highest number of GDP per capita.
# Scores
ggplot(happy20_total, aes(x = "", y = Score, fill = Country.or.region)) + geom_bar(stat="identity",position="dodge") +
xlab("Country or Region") + ylab("Scores")
# GDP
ggplot(happy20_total, aes(x = "", y = GDP.per.capita, fill = Country.or.region)) + geom_bar(stat="identity",position="dodge") +
xlab("Country or Region") + ylab("GDP Per Capita")
# Total
ggplot(happy20_total, aes(x = "", y = Total, fill = Country.or.region)) +
geom_bar(stat="identity",position="dodge") +
xlab("Country or Region") + ylab("Total")
There is no correlation between the Happiness Score and the GDP per capita. Using the GDP per capita analysis on this data set is ineffective. In order for this data to become more accurate, we would need to consider one important variable: The classes of people. GDP per capita measures the quality of life, but we must take into consideration the classes of people (upper, lower, and middle class) and income levels.
I used Jered Ataky’s dataset that he retrieved from an article called Students Performance in Exams on kaggle.com. I saved the csv file on my github page. We will check to see if there is a correlation between students performance and parental level of education.
# Upload data set
students<-read.csv("https://raw.githubusercontent.com/sagreen131/DATA-607-Project-2/main/StudentsPerformance.csv", header=TRUE, stringsAsFactors=FALSE)
head(students)
## gender race.ethnicity parental.level.of.education lunch
## 1 female group B bachelor's degree standard
## 2 female group C some college standard
## 3 female group B master's degree standard
## 4 male group A associate's degree free/reduced
## 5 male group C some college standard
## 6 female group B associate's degree standard
## test.preparation.course math.score reading.score writing.score
## 1 none 72 72 74
## 2 completed 69 90 88
## 3 none 90 95 93
## 4 none 47 57 44
## 5 none 76 78 75
## 6 none 71 83 78
Let’s clean up and rename the columns for neatness!
names(students)[1] <- "Gender"
names(students)[2] <- "Race.Ethnicity"
names(students)[3] <- "Parental.LOE"
names(students)[4] <- "Lunch"
names(students)[5]<- "Test.Prep"
names(students)[6] <- "Math"
names(students)[7] <- "Reading"
names(students)[8] <- "Writing"
head(students)
## Gender Race.Ethnicity Parental.LOE Lunch Test.Prep Math Reading
## 1 female group B bachelor's degree standard none 72 72
## 2 female group C some college standard completed 69 90
## 3 female group B master's degree standard none 90 95
## 4 male group A associate's degree free/reduced none 47 57
## 5 male group C some college standard none 76 78
## 6 female group B associate's degree standard none 71 83
## Writing
## 1 74
## 2 88
## 3 93
## 4 44
## 5 75
## 6 78
Since there are one thousand observations with only 8 variables, it’s safe to say we can spread the data set some more.
students_wide<-spread(students, Parental.LOE, Test.Prep)
head(students_wide)
## Gender Race.Ethnicity Lunch Math Reading Writing associate's degree
## 1 female group A free/reduced 34 48 41 <NA>
## 2 female group A free/reduced 37 57 56 none
## 3 female group A free/reduced 38 43 43 <NA>
## 4 female group A free/reduced 41 51 48 none
## 5 female group A free/reduced 44 45 45 <NA>
## 6 female group A free/reduced 44 64 58 <NA>
## bachelor's degree high school master's degree some college some high school
## 1 <NA> completed <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> none
## 4 <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> none
## 6 <NA> <NA> <NA> <NA> none
Let’s sort out the scores for Math, Reading and Writing:
# Math
students_M<-students[order(students$Math, decreasing = TRUE),]
head(students_M)
## Gender Race.Ethnicity Parental.LOE Lunch Test.Prep Math
## 150 male group E associate's degree free/reduced completed 100
## 452 female group E some college standard none 100
## 459 female group E bachelor's degree standard none 100
## 624 male group A some college standard completed 100
## 626 male group D some college standard completed 100
## 917 male group E bachelor's degree standard completed 100
## Reading Writing
## 150 100 93
## 452 92 97
## 459 100 100
## 624 96 86
## 626 97 99
## 917 100 100
By sorting out the data for Math, group E students have taken the top three spots here. Of the three, two are female. If we take a look at the 6th indivdual listed here, he has received 100 percent in all three categories!
# Reading
students_R<-students[order(students$Reading, decreasing = TRUE),]
head(students_R)
## Gender Race.Ethnicity Parental.LOE Lunch Test.Prep Math
## 107 female group D master's degree standard none 87
## 115 female group E bachelor's degree standard completed 99
## 150 male group E associate's degree free/reduced completed 100
## 166 female group C bachelor's degree standard completed 96
## 180 female group D some high school standard completed 97
## 382 male group C associate's degree standard completed 87
## Reading Writing
## 107 100 100
## 115 100 100
## 150 100 93
## 166 100 100
## 180 100 100
## 382 100 95
By sorting out the data for Reading, it is split between group C,D, and E. More females have taken the spots in this scenario. The student who is at the top of the list has a parent who has a Master’s Degree. This is the second case I came across when comparing all three subjects.
# Writing
students_W<-students[order(students$Writing, decreasing = TRUE),]
head(students_W)
## Gender Race.Ethnicity Parental.LOE Lunch Test.Prep Math Reading
## 107 female group D master's degree standard none 87 100
## 115 female group E bachelor's degree standard completed 99 100
## 166 female group C bachelor's degree standard completed 96 100
## 180 female group D some high school standard completed 97 100
## 378 female group D master's degree free/reduced completed 85 95
## 404 female group D high school standard completed 88 99
## Writing
## 107 100
## 115 100
## 166 100
## 180 100
## 378 100
## 404 100
By sorting out the data for Writing, the females have dominated the first six spots. In comparison to Reading, the top student also has a parent who has obtained a Master’s Degree. In addition to this, this is the second student who did not have any Test Prep.
I will create a new column to to average the three subjects and attach it to the data frame.
students <- students %>%
mutate(MRW.avg= (Math + Reading + Writing)/3)
head(students)
## Gender Race.Ethnicity Parental.LOE Lunch Test.Prep Math Reading
## 1 female group B bachelor's degree standard none 72 72
## 2 female group C some college standard completed 69 90
## 3 female group B master's degree standard none 90 95
## 4 male group A associate's degree free/reduced none 47 57
## 5 male group C some college standard none 76 78
## 6 female group B associate's degree standard none 71 83
## Writing MRW.avg
## 1 74 72.66667
## 2 88 82.33333
## 3 93 92.66667
## 4 44 49.33333
## 5 75 76.33333
## 6 78 77.33333
students_avg<-select(students, Gender, Parental.LOE, Math, Reading, Writing, MRW.avg)
head(students_avg)
## Gender Parental.LOE Math Reading Writing MRW.avg
## 1 female bachelor's degree 72 72 74 72.66667
## 2 female some college 69 90 88 82.33333
## 3 female master's degree 90 95 93 92.66667
## 4 male associate's degree 47 57 44 49.33333
## 5 male some college 76 78 75 76.33333
## 6 female associate's degree 71 83 78 77.33333
students_avg<-students_avg[order(students_avg$MRW.avg, decreasing = TRUE),]
head(students_avg)
## Gender Parental.LOE Math Reading Writing MRW.avg
## 459 female bachelor's degree 100 100 100 100.00000
## 917 male bachelor's degree 100 100 100 100.00000
## 963 female associate's degree 100 100 100 100.00000
## 115 female bachelor's degree 99 100 100 99.66667
## 180 female some high school 97 100 100 99.00000
## 713 female some college 98 100 99 99.00000
ggplot(students_avg, aes(x = "", y = MRW.avg, fill = Parental.LOE)) + geom_bar(stat="identity",position="dodge") +
xlab("Parental Level of Education") + ylab("Average Scores")
After completing the analysis, there is correlation between student’s performance and parental education. There are several factors that may need to take place for a more thorough analysis.
Githubhere
RPubs here