I used dplyr
and tidyr
since they are required for this assignment, and ggplot2
to create a basic bar plot summarising the flight delay information.
require(dplyr)
require(tidyr)
require(ggplot2)
The first data set that I chose to tidy and analyze was the “Basketball Wins” data provided by Daniel Brooks. For all three data sets, I used either the supplied .csv, or copy/pasted the data into Excel and then saved as a .csv (I only used this method to save the file, all tidying was done in R). The file was then loaded into my GitHub repository, and I used the same method to load the data into R.
# open file
path <- ("https://raw.githubusercontent.com/Logan213/DATA607_Project2/master/NBA_Wins.csv")
con <- file(path, open="r")
nba <- read.csv(con, header = TRUE, stringsAsFactors = FALSE)
# close file
close(con)
head(nba[1:20], 10)
## Rk Season Lg ATL BOS BRK CHI CHO CLE DAL DEN DET GSW HOU IND LAC LAL
## 1 1 2015-16 NBA 37 39 18 32 37 46 33 28 34 59 33 35 42 14
## 2 2 2014-15 NBA 60 40 38 50 33 53 50 30 32 67 56 38 56 21
## 3 3 2013-14 NBA 38 25 44 48 43 33 49 36 29 51 54 56 57 27
## 4 4 2012-13 NBA 44 41 49 45 21 24 41 57 29 47 45 49 56 45
## 5 5 2011-12 NBA 40 39 22 50 7 21 36 38 25 23 34 42 40 41
## 6 6 2010-11 NBA 44 56 24 62 34 19 57 50 30 36 43 37 32 57
## 7 7 2009-10 NBA 53 50 12 41 44 61 55 53 27 26 42 32 29 57
## 8 8 2008-09 NBA 47 62 34 41 35 66 50 54 39 29 53 36 19 65
## 9 9 2007-08 NBA 37 66 34 33 32 45 51 50 59 48 55 36 23 57
## 10 10 2006-07 NBA 30 24 41 49 33 50 67 45 53 42 52 35 40 42
## MEM MIA MIL
## 1 39 38 28
## 2 55 37 41
## 3 50 54 15
## 4 56 66 38
## 5 41 46 31
## 6 46 58 35
## 7 40 47 46
## 8 24 43 34
## 9 22 15 26
## 10 22 44 28
The NBA Wins data contained rows that were basically a repeat of the header, as well as a “rank” column that was really just the ordering of the NBA/ABA seasons. I removed these with the slice
and select
functions, respectively.
There was also a “Total” row for each team; since I was going to use dplyr
to calculate any totals, I removed any rows that contained this term. Lastly, I pivoted the data using the gather
function to go from the wide format with the teams spread out, to a nice, tidy long format. Since I was going to be doing calculations on the number of wins, I converted this to numeric.
# remove repeated header rows and "rank" column
nba <- nba %>%
slice(c(-21, -42, -63)) %>%
select(Season:WAS) %>%
filter(Season != "Total") %>%
gather(Team, Wins, ATL:WAS) %>%
filter(Wins != "")
nba$Wins <- as.numeric(nba$Wins)
all_time <- arrange(nba, desc(Wins))
head(nba, 10)
## Season Lg Team Wins
## 1 2015-16 NBA ATL 37
## 2 2014-15 NBA ATL 60
## 3 2013-14 NBA ATL 38
## 4 2012-13 NBA ATL 44
## 5 2011-12 NBA ATL 40
## 6 2010-11 NBA ATL 44
## 7 2009-10 NBA ATL 53
## 8 2008-09 NBA ATL 47
## 9 2007-08 NBA ATL 37
## 10 2006-07 NBA ATL 30
For the analysis of the NBA data, I used the “piping” to make the code more readable. I also found that using this method allows for quicker adjustments when tidying or arranging the data, as the rest of the code can be kept and one or two parameters can be changed.
Since we’re mainly dealing with the teams, wins each season, and the season itself, it was fairly simple to group and summarise the data by Team and then Wins (respectively).
tot_wins <- nba %>%
group_by(Team) %>%
summarise(Total = sum(Wins)) %>%
arrange(desc(Total))
I then arranged by the total wins (a calculated column), just to show the teams who have won the most in NBA history are:
head(tot_wins, 10)
## Source: local data frame [10 x 2]
##
## Team Total
## (chr) (dbl)
## 1 LAL 3232
## 2 BOS 3212
## 3 PHI 2725
## 4 NYK 2696
## 5 ATL 2639
## 6 DET 2606
## 7 GSW 2576
## 8 SAC 2456
## 9 OKC 2128
## 10 PHO 2116
The list above is not an average, but just a running total. So, older teams have more wins than those who are more recent league expansion teams. Still, we can see a lot of familiar teams that have had consistent success over the years.
avg_wins <- nba %>%
group_by(Team) %>%
summarise(Avg_Wins = mean(Wins)) %>%
arrange(desc(Avg_Wins))
Just out of curiosity, I changed the summarise function to compute an average, so we can see the teams that average the most wins, regardless of how many seasons they have been in existence:
head(avg_wins, 10)
## Source: local data frame [10 x 2]
##
## Team Avg_Wins
## (chr) (dbl)
## 1 SAS 49.87500
## 2 LAL 47.52941
## 3 BOS 45.88571
## 4 PHO 44.08333
## 5 OKC 43.42857
## 6 POR 43.08696
## 7 UTA 42.97619
## 8 CHI 42.10000
## 9 HOU 41.71429
## 10 MIA 41.42857
Looks like if the San Antonio Spurs were in existence a little longer, they would probably overtake the LA Lakers in total wins.
Again, I used the group_by
function to group by team, and then simply selected the top number from the “Wins” column to get the highest number of wins by a team in a single season (not including post-season play):
team_max_wins <- nba %>%
group_by(Team) %>%
top_n(1, Wins)
Below is an alphabetical list of each NBA team, the highest number of wins acheived, and the season in which this feat was accomplished.
head(team_max_wins, 10)
## Source: local data frame [10 x 4]
## Groups: Team [10]
##
## Season Lg Team Wins
## (chr) (chr) (chr) (dbl)
## 1 2014-15 NBA ATL 60
## 2 1972-73 NBA BOS 68
## 3 2001-02 NBA BRK 52
## 4 1995-96 NBA CHI 72
## 5 1996-97 NBA CHO 54
## 6 2008-09 NBA CLE 66
## 7 2006-07 NBA DAL 67
## 8 2012-13 NBA DEN 57
## 9 2005-06 NBA DET 64
## 10 2014-15 NBA GSW 67
wins_by_yr <- nba %>%
group_by(Season) %>%
top_n(1, Wins) %>%
distinct() %>%
arrange(desc(Season))
Below I have created a plot of the last 20 years winningest teams from the data, just to illustrate the “bar” for what most NBA teams shoot for each season (60 wins). Interestingly, many of the teams with the most wins, were also the team that won the championship that year - this does not always hold true in other major professional sports.
plot_wins <- tail(wins_by_yr, 20)
# plot of each season's team with most wins
ggplot(plot_wins, aes(plot_wins$Season, plot_wins$Wins)) + geom_bar(stat="identity") + geom_text(aes(label=plot_wins$Team), vjust=1.5, color="white") + labs(title = "Most Wins Each Season")
The next data set I chose to tidy and analyze was the “Leading Causes of Death in NYC”, provided by Armenoush Aslanian-persico.
# open file
path1 <- ("https://raw.githubusercontent.com/Logan213/DATA607_Project2/master/New_York_City_Leading_Causes_of_Death.csv")
con1 <- file(path1, open="r")
nyc_deaths <- read.csv(con1, header = TRUE, stringsAsFactors = FALSE)
# close file
close(con1)
head(nyc_deaths, 10)
## Year Ethnicity Sex
## 1 2010 NON-HISPANIC BLACK MALE
## 2 2010 NON-HISPANIC BLACK MALE
## 3 2010 NON-HISPANIC BLACK MALE
## 4 2010 NON-HISPANIC BLACK MALE
## 5 2010 NON-HISPANIC BLACK MALE
## 6 2010 NON-HISPANIC BLACK MALE
## 7 2010 NON-HISPANIC BLACK MALE
## 8 2010 NON-HISPANIC BLACK MALE
## 9 2010 NON-HISPANIC BLACK MALE
## 10 2010 NON-HISPANIC BLACK MALE
## Cause.of.Death Count Percent
## 1 HUMAN IMMUNODEFICIENCY VIRUS DISEASE 297 5
## 2 INFLUENZA AND PNEUMONIA 201 3
## 3 INTENTIONAL SELF-HARM (SUICIDE) 64 1
## 4 MALIGNANT NEOPLASMS 1540 23
## 5 MENTAL DISORDERS DUE TO USE OF ALCOHOL 50 1
## 6 NEPHRITIS, NEPHROTIC SYNDROME AND NEPHROSIS 70 1
## 7 PEPTIC ULCER 13 0
## 8 PSYCH. SUBSTANCE USE & ACCIDENTAL DRUG POISONING 111 2
## 9 SEPTICEMIA 36 1
## 10 SHORT GESTATION/LBW 35 1
Because the data were already provided in a fairly structured format, it did not require a lot of tidying in a sense of removing any rows, renaming headers, etc. I did notice there were some repeated rows, so I simply passed the distinct
function while “piping” through my other functions.
I sorted the data using group_by
, and then found out about the tally
function, which makes for a quick and easy way to sum up data for the column that is passed into the function. Any grouped data is maintained, so the combination of these two functions quickly gave me the total number for each cause of death. Using top_n
, I selected the highest of each one of these vales, and then arranged the data from earliest year to most recent.
gender_cause <- nyc_deaths %>%
distinct %>%
group_by(Year, Sex, Cause.of.Death) %>%
tally(Count, sort=TRUE) %>%
top_n(1, n) %>%
arrange(Year)
For each year, Diseases of Heart is the clear winner, for both males and females:
gender_cause
## Source: local data frame [10 x 4]
## Groups: Year, Sex [10]
##
## Year Sex Cause.of.Death n
## (int) (chr) (chr) (int)
## 1 2007 FEMALE DISEASES OF HEART 11618
## 2 2007 MALE DISEASES OF HEART 9577
## 3 2008 FEMALE DISEASES OF HEART 11462
## 4 2008 MALE DISEASES OF HEART 9456
## 5 2009 FEMALE DISEASES OF HEART 10630
## 6 2009 MALE DISEASES OF HEART 9173
## 7 2010 FEMALE DISEASES OF HEART 9343
## 8 2010 MALE DISEASES OF HEART 8344
## 9 2011 FEMALE DISEASES OF HEART 9009
## 10 2011 MALE DISEASES OF HEART 7713
Similar to the question above, I used a combination of group_by
, tally
, and top_n
to arrange and sum the data to get the required answer. However, this time I used “Ethnicity” instead of gender so that I could filter by ethnic group.
ethnic_cause <- nyc_deaths %>%
distinct %>%
group_by(Year, Ethnicity, Cause.of.Death) %>%
tally(Count, sort=TRUE) %>%
top_n(1, n)
Below we can see that, as expected, “Diseases of Heart” appears the most. The only other cause of death that comes up, oddly enough is “Malignant Neoplasms”, both of which appeared for Asian and Pacific Islanders:
ethnic_cause
## Source: local data frame [20 x 4]
## Groups: Year, Ethnicity [20]
##
## Year Ethnicity Cause.of.Death n
## (int) (chr) (chr) (int)
## 1 2007 ASIAN & PACIFIC ISLANDER DISEASES OF HEART 925
## 2 2007 HISPANIC DISEASES OF HEART 2745
## 3 2007 NON-HISPANIC BLACK DISEASES OF HEART 4843
## 4 2007 NON-HISPANIC WHITE DISEASES OF HEART 12682
## 5 2008 ASIAN & PACIFIC ISLANDER DISEASES OF HEART 1002
## 6 2008 HISPANIC DISEASES OF HEART 2775
## 7 2008 NON-HISPANIC BLACK DISEASES OF HEART 4802
## 8 2008 NON-HISPANIC WHITE DISEASES OF HEART 12339
## 9 2009 ASIAN & PACIFIC ISLANDER DISEASES OF HEART 1004
## 10 2009 HISPANIC DISEASES OF HEART 2731
## 11 2009 NON-HISPANIC BLACK DISEASES OF HEART 4603
## 12 2009 NON-HISPANIC WHITE DISEASES OF HEART 11465
## 13 2010 ASIAN & PACIFIC ISLANDER MALIGNANT NEOPLASMS 943
## 14 2010 HISPANIC DISEASES OF HEART 2671
## 15 2010 NON-HISPANIC BLACK DISEASES OF HEART 4297
## 16 2010 NON-HISPANIC WHITE DISEASES OF HEART 9846
## 17 2011 ASIAN & PACIFIC ISLANDER MALIGNANT NEOPLASMS 1004
## 18 2011 HISPANIC DISEASES OF HEART 2549
## 19 2011 NON-HISPANIC BLACK DISEASES OF HEART 4083
## 20 2011 NON-HISPANIC WHITE DISEASES OF HEART 9236
cause_change <- nyc_deaths %>%
distinct %>%
filter(Year == 2007 | Year == 2011) %>%
group_by(Year, Cause.of.Death) %>%
tally(Count) %>%
spread(Year, n) %>%
mutate(Pct_Change= ((`2011` - `2007`) / `2007`) * 100)
most <- arrange(cause_change, desc(Pct_Change))
least <- arrange(cause_change, Pct_Change)
After filtering the data and creating the calculated column, the same object was used to create the most
and least
objects, each one basically showing the data sorted a different way.
There has been a large increase in both Alzheimers and Parkinson’s diseases, this may be due to better dectection methods. Interestingly enough, the number one cause of death across all years (diseases of heart) is actually in the group with the largest decline.
# Increased the most
head(most, 10)
## Source: local data frame [10 x 4]
##
## Cause.of.Death 2007 2011 Pct_Change
## (chr) (int) (int) (dbl)
## 1 ALZHEIMERS DISEASE 276 619 124.27536
## 2 PARKINSONS DISEASE 100 185 85.00000
## 3 ANEMIAS 37 67 81.08108
## 4 PEPTIC ULCER 71 94 32.39437
## 5 CHRONIC LOWER RESPIRATORY DISEASES 1411 1758 24.59249
## 6 ESSENTIAL HYPERTENSION AND RENAL DISEASES 782 955 22.12276
## 7 CHRONIC LIVER DISEASE AND CIRRHOSIS 445 542 21.79775
## 8 DIABETES MELLITUS 1535 1735 13.02932
## 9 CEREBROVASCULAR DISEASE 1547 1740 12.47576
## 10 INFLUENZA AND PNEUMONIA 2219 2463 10.99594
# Decreased the most
head(least, 10)
## Source: local data frame [10 x 4]
##
## Cause.of.Death 2007 2011 Pct_Change
## (chr) (int) (int) (dbl)
## 1 CARDIOVASCULAR DISORDERS IN PERINATAL PERIOD 76 24 -68.421053
## 2 HUMAN IMMUNODEFICIENCY VIRUS DISEASE 1095 747 -31.780822
## 3 AORTIC ANEURYSM AND DISSECTION 222 157 -29.279279
## 4 PREGNANCY, CHILDBIRTH AND THE PUERPERIUM 33 25 -24.242424
## 5 CHOLELITHIASIS AND DISORDERS OF GALLBLADDER 27 21 -22.222222
## 6 DISEASES OF HEART 21195 16722 -21.104034
## 7 ATHEROSCLEROSIS 171 143 -16.374269
## 8 PSYCH. SUBSTANCE USE & ACCIDENTAL DRUG POISONING 834 748 -10.311751
## 9 MENTAL DISORDERS DUE TO USE OF ALCOHOL 211 200 -5.213270
## 10 SEPTICEMIA 387 375 -3.100775
For this question, I again used the cause_change
object that I created, and simply filtered out anything that was above or below the 10% change in rate from 2007 to 2011. Short of using more scientific methods to calculate stability, I took the causes of death whose numbers remained the same or close to it from 2007 to 2011. Note, this does not calculate a year-over-year average or similar measure.
What is probably most interesting about this data is that many of these causes can be attributed to human behavior (suicide, homicide, alcohol use). I guess humans are creatures of habit…
stable <- cause_change %>%
filter(Pct_Change >= -10 & Pct_Change <= 10) %>%
arrange(Pct_Change)
stable
## Source: local data frame [11 x 4]
##
## Cause.of.Death 2007 2011 Pct_Change
## (chr) (int) (int) (dbl)
## 1 MENTAL DISORDERS DUE TO USE OF ALCOHOL 211 200 -5.213270
## 2 SEPTICEMIA 387 375 -3.100775
## 3 ACCIDENTS EXCEPT DRUG POISONING 1016 1002 -1.377953
## 4 CONGENITAL MALFORMATIONS,DEFORMATIONS 232 232 0.000000
## 5 SHORT GESTATION/LBW 92 92 0.000000
## 6 VIRAL HEPATITIS 357 363 1.680672
## 7 MALIGNANT NEOPLASMS 13104 13336 1.770452
## 8 INTENTIONAL SELF-HARM (SUICIDE) 470 485 3.191489
## 9 NEPHRITIS, NEPHROTIC SYNDROME AND NEPHROSIS 430 445 3.488372
## 10 ASSAULT (HOMICIDE) 503 521 3.578529
## 11 BENIGN AND UNCERTAIN NEOPLASMS 242 263 8.677686
The last data set I chose was the “Generator Capacity Prices”, provided by Daniel Smilowitz. Because of the double header rows, I actually read in the whole .csv and set the header parameter to FALSE
, then captured the first two rows and pasted them together, renamed the first column header, and then used this to re-name the headers for the generator data. After doing this, re-arranging and tidying the data became much easier.
# open file
path2 <- ("https://raw.githubusercontent.com/Logan213/DATA607_Project2/master/UCAP.csv")
con2 <- file(path2, open="r")
generate <- read.csv(con2, header = FALSE, stringsAsFactors = FALSE)
# close file
close(con2)
row1 <- generate[1,]
row2 <- generate[2,]
gen_header <- paste(row1, row2, sep= " ")
gen_header[1] <- "Date"
generate <- slice(generate, c(-1,-2))
names(generate) <- gen_header
generate <- generate %>%
gather("Type", "Amount", 2:13) %>%
separate(Date, c("Month", "Year")) %>%
separate(Type, c("Auction", "Location"))
generate$Amount <- as.numeric(sub("\\$","", generate$Amount))
## Warning: NAs introduced by coercion
head(generate, 5)
## Month Year Auction Location Amount
## 1 Nov 03 Monthly NYC 6.67
## 2 Dec 03 Monthly NYC 6.67
## 3 Jan 04 Monthly NYC 6.67
## 4 Feb 04 Monthly NYC 6.95
## 5 Mar 04 Monthly NYC 6.25
These questions were a little more difficult to arrange the data for in order to give an answer, but I believe I retrieved the necessary results.
For this question, I used an average of all years provided. Using a combination for group_by
and summarise
, I was able to quickly filter and arrange the generator data to display the average price by month and location. Then, using top_n
, I selcted the number one value (of price) by region.
high_price_mo <- generate %>%
filter(Amount != " ") %>%
group_by(Location, Month) %>%
summarise(avg = mean(Amount)) %>%
top_n(1, Month)
Below is the summarized data - looks like September is a very expensive month across the board for all regions. Of course, the NYC region is the most expensive by far.
high_price_mo
## Source: local data frame [4 x 3]
## Groups: Location [4]
##
## Location Month avg
## (chr) (chr) (dbl)
## 1 LHV Sep 3.476667
## 2 LI Sep 4.790278
## 3 NYC Sep 11.892500
## 4 ROS Sep 2.580833
To find the average price difference between the NYC and “rest of state” regions, I first filtered the data to get rid of the null values, and then select only the regions relevant to the question. Using a combination of group_by
and then spread
functions, I arranged the data so that I could add a calculated column which showed the difference in price paid to generators for the two regions.
Using tally
, I grouped and summed the months together, and then used summarise
to come up with the average difference for each month.
nyc_vs_ros <- generate %>%
filter(Amount != " ") %>%
filter(Location == "NYC" | Location == "ROS") %>%
group_by(Month, Year) %>%
spread(Location, Amount) %>%
mutate(Price_diff = NYC - ROS) %>%
tally(Price_diff) %>%
summarise(Avg_dif = mean(n)) %>%
arrange(desc(Avg_dif))
We can see the results in the data below, surprisingly the coldest months of the year have the lowest differential in price.:
nyc_vs_ros
## Source: local data frame [12 x 2]
##
## Month Avg_dif
## (chr) (dbl)
## 1 May 29.32833
## 2 Jun 28.77750
## 3 Oct 28.36250
## 4 Aug 28.12667
## 5 Jul 28.11750
## 6 Sep 27.93500
## 7 Mar 13.21692
## 8 Apr 12.64231
## 9 Nov 12.52538
## 10 Feb 12.23000
## 11 Dec 12.22923
## 12 Jan 11.50846
To find the answer to this question, I grouped the data by Year and Location, and then used summarise to “compress” each month of the year into an average price for that year. I then spread this data across four columns, and then used mutate to add a calculated column which sums the four regions and divides by 4 to get the average. I had to use this method as using mean
again was causing some issues.
high_by_region <- generate %>%
filter(Amount != " ") %>%
group_by(Year, Location) %>%
summarise(Avg_Price = mean(Amount)) %>%
spread(Location, Avg_Price) %>%
mutate(ALL_Regions = (sum(LHV+LI+NYC+ROS) / 4))
Below we can see the average price paid to generators across each of the four regions, with the All_Regions
column added. This column is an average of the four regions, and we can see that 2012 had the highest average price paid to generators.
high_by_region
## Source: local data frame [14 x 6]
## Groups: Year [14]
##
## Year LHV LI NYC ROS ALL_Regions
## (chr) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 03 1.4650000 4.9766667 6.733333 1.4650000 3.660000
## 2 04 1.2280556 7.0438889 9.044444 1.2280556 4.636111
## 3 05 0.8172222 6.8991667 9.228889 0.8172222 4.440625
## 4 06 1.8297222 5.4408333 9.251667 1.8297222 4.587986
## 5 07 2.4558333 4.6327273 9.098056 2.4558333 4.660612
## 6 08 2.2494444 2.6559375 4.825556 2.2494444 2.995095
## 7 09 2.3513889 2.4313889 4.934167 2.3513889 3.017083
## 8 10 1.6077778 1.6736111 9.055278 1.6077778 3.486111
## 9 11 0.3433333 0.3438889 7.096111 0.3433333 2.031667
## 10 12 1.1208333 1.4747222 7.958056 1.1208333 2.918611
## 11 13 3.6122222 4.7125000 10.846389 3.6122222 5.695833
## 12 14 7.6408333 4.9508333 13.245000 4.2941667 7.532708
## 13 15 6.5630556 4.0811111 11.573889 2.7991667 6.254306
## 14 16 3.4170000 1.5960000 6.256000 1.1960000 3.116250