IS 607 Project 2

Load Libraries

I used dplyr and tidyr since they are required for this assignment, and ggplot2 to create a basic bar plot summarising the flight delay information.

require(dplyr)
require(tidyr)
require(ggplot2)

Basketball Wins

The first data set that I chose to tidy and analyze was the “Basketball Wins” data provided by Daniel Brooks. For all three data sets, I used either the supplied .csv, or copy/pasted the data into Excel and then saved as a .csv (I only used this method to save the file, all tidying was done in R). The file was then loaded into my GitHub repository, and I used the same method to load the data into R.

# open file
path <- ("https://raw.githubusercontent.com/Logan213/DATA607_Project2/master/NBA_Wins.csv")
con <- file(path, open="r")
nba <- read.csv(con, header = TRUE, stringsAsFactors = FALSE)

# close file
close(con)

head(nba[1:20], 10)

##    Rk  Season  Lg ATL BOS BRK CHI CHO CLE DAL DEN DET GSW HOU IND LAC LAL
## 1   1 2015-16 NBA  37  39  18  32  37  46  33  28  34  59  33  35  42  14
## 2   2 2014-15 NBA  60  40  38  50  33  53  50  30  32  67  56  38  56  21
## 3   3 2013-14 NBA  38  25  44  48  43  33  49  36  29  51  54  56  57  27
## 4   4 2012-13 NBA  44  41  49  45  21  24  41  57  29  47  45  49  56  45
## 5   5 2011-12 NBA  40  39  22  50   7  21  36  38  25  23  34  42  40  41
## 6   6 2010-11 NBA  44  56  24  62  34  19  57  50  30  36  43  37  32  57
## 7   7 2009-10 NBA  53  50  12  41  44  61  55  53  27  26  42  32  29  57
## 8   8 2008-09 NBA  47  62  34  41  35  66  50  54  39  29  53  36  19  65
## 9   9 2007-08 NBA  37  66  34  33  32  45  51  50  59  48  55  36  23  57
## 10 10 2006-07 NBA  30  24  41  49  33  50  67  45  53  42  52  35  40  42
##    MEM MIA MIL
## 1   39  38  28
## 2   55  37  41
## 3   50  54  15
## 4   56  66  38
## 5   41  46  31
## 6   46  58  35
## 7   40  47  46
## 8   24  43  34
## 9   22  15  26
## 10  22  44  28

Tidying Data

The NBA Wins data contained rows that were basically a repeat of the header, as well as a “rank” column that was really just the ordering of the NBA/ABA seasons. I removed these with the slice and select functions, respectively.

There was also a “Total” row for each team; since I was going to use dplyr to calculate any totals, I removed any rows that contained this term. Lastly, I pivoted the data using the gather function to go from the wide format with the teams spread out, to a nice, tidy long format. Since I was going to be doing calculations on the number of wins, I converted this to numeric.

# remove repeated header rows and "rank" column

nba <- nba %>%
  slice(c(-21, -42, -63)) %>%
  select(Season:WAS) %>%
  filter(Season != "Total") %>%
  gather(Team, Wins, ATL:WAS) %>%
  filter(Wins != "")

nba$Wins <- as.numeric(nba$Wins)

all_time <- arrange(nba, desc(Wins))

head(nba, 10)

##     Season  Lg Team Wins
## 1  2015-16 NBA  ATL   37
## 2  2014-15 NBA  ATL   60
## 3  2013-14 NBA  ATL   38
## 4  2012-13 NBA  ATL   44
## 5  2011-12 NBA  ATL   40
## 6  2010-11 NBA  ATL   44
## 7  2009-10 NBA  ATL   53
## 8  2008-09 NBA  ATL   47
## 9  2007-08 NBA  ATL   37
## 10 2006-07 NBA  ATL   30

Analysis

For the analysis of the NBA data, I used the “piping” to make the code more readable. I also found that using this method allows for quicker adjustments when tidying or arranging the data, as the rest of the code can be kept and one or two parameters can be changed.

Cumulative wins over lifetime of NBA/BBA

Since we’re mainly dealing with the teams, wins each season, and the season itself, it was fairly simple to group and summarise the data by Team and then Wins (respectively).

tot_wins <- nba %>%
  group_by(Team) %>%
  summarise(Total = sum(Wins)) %>%
  arrange(desc(Total))

I then arranged by the total wins (a calculated column), just to show the teams who have won the most in NBA history are:

head(tot_wins, 10)

## Source: local data frame [10 x 2]
## 
##     Team Total
##    (chr) (dbl)
## 1    LAL  3232
## 2    BOS  3212
## 3    PHI  2725
## 4    NYK  2696
## 5    ATL  2639
## 6    DET  2606
## 7    GSW  2576
## 8    SAC  2456
## 9    OKC  2128
## 10   PHO  2116

Average wins over lifetime of NBA/BBA

The list above is not an average, but just a running total. So, older teams have more wins than those who are more recent league expansion teams. Still, we can see a lot of familiar teams that have had consistent success over the years.

avg_wins <- nba %>%
  group_by(Team) %>%
  summarise(Avg_Wins = mean(Wins)) %>%
  arrange(desc(Avg_Wins))

Just out of curiosity, I changed the summarise function to compute an average, so we can see the teams that average the most wins, regardless of how many seasons they have been in existence:

head(avg_wins, 10)

## Source: local data frame [10 x 2]
## 
##     Team Avg_Wins
##    (chr)    (dbl)
## 1    SAS 49.87500
## 2    LAL 47.52941
## 3    BOS 45.88571
## 4    PHO 44.08333
## 5    OKC 43.42857
## 6    POR 43.08696
## 7    UTA 42.97619
## 8    CHI 42.10000
## 9    HOU 41.71429
## 10   MIA 41.42857

Looks like if the San Antonio Spurs were in existence a little longer, they would probably overtake the LA Lakers in total wins.

Most wins in a season in franchise history

Again, I used the group_by function to group by team, and then simply selected the top number from the “Wins” column to get the highest number of wins by a team in a single season (not including post-season play):

team_max_wins <- nba %>%
  group_by(Team) %>%
  top_n(1, Wins)

Below is an alphabetical list of each NBA team, the highest number of wins acheived, and the season in which this feat was accomplished.

head(team_max_wins, 10)

## Source: local data frame [10 x 4]
## Groups: Team [10]
## 
##     Season    Lg  Team  Wins
##      (chr) (chr) (chr) (dbl)
## 1  2014-15   NBA   ATL    60
## 2  1972-73   NBA   BOS    68
## 3  2001-02   NBA   BRK    52
## 4  1995-96   NBA   CHI    72
## 5  1996-97   NBA   CHO    54
## 6  2008-09   NBA   CLE    66
## 7  2006-07   NBA   DAL    67
## 8  2012-13   NBA   DEN    57
## 9  2005-06   NBA   DET    64
## 10 2014-15   NBA   GSW    67

Most wins by team by year

wins_by_yr <- nba %>%
  group_by(Season) %>%
  top_n(1, Wins) %>%
  distinct() %>%
  arrange(desc(Season))

Below I have created a plot of the last 20 years winningest teams from the data, just to illustrate the “bar” for what most NBA teams shoot for each season (60 wins). Interestingly, many of the teams with the most wins, were also the team that won the championship that year - this does not always hold true in other major professional sports.

plot_wins <- tail(wins_by_yr, 20)

# plot of each season's team with most wins
ggplot(plot_wins, aes(plot_wins$Season, plot_wins$Wins)) + geom_bar(stat="identity") + geom_text(aes(label=plot_wins$Team), vjust=1.5, color="white") + labs(title = "Most Wins Each Season")

Leading Causes of Death in NYC

The next data set I chose to tidy and analyze was the “Leading Causes of Death in NYC”, provided by Armenoush Aslanian-persico.

# open file
path1 <- ("https://raw.githubusercontent.com/Logan213/DATA607_Project2/master/New_York_City_Leading_Causes_of_Death.csv")
con1 <- file(path1, open="r")
nyc_deaths <- read.csv(con1, header = TRUE, stringsAsFactors = FALSE)

# close file
close(con1)

head(nyc_deaths, 10)

##    Year          Ethnicity  Sex
## 1  2010 NON-HISPANIC BLACK MALE
## 2  2010 NON-HISPANIC BLACK MALE
## 3  2010 NON-HISPANIC BLACK MALE
## 4  2010 NON-HISPANIC BLACK MALE
## 5  2010 NON-HISPANIC BLACK MALE
## 6  2010 NON-HISPANIC BLACK MALE
## 7  2010 NON-HISPANIC BLACK MALE
## 8  2010 NON-HISPANIC BLACK MALE
## 9  2010 NON-HISPANIC BLACK MALE
## 10 2010 NON-HISPANIC BLACK MALE
##                                      Cause.of.Death Count Percent
## 1              HUMAN IMMUNODEFICIENCY VIRUS DISEASE   297       5
## 2                           INFLUENZA AND PNEUMONIA   201       3
## 3                   INTENTIONAL SELF-HARM (SUICIDE)    64       1
## 4                               MALIGNANT NEOPLASMS  1540      23
## 5            MENTAL DISORDERS DUE TO USE OF ALCOHOL    50       1
## 6       NEPHRITIS, NEPHROTIC SYNDROME AND NEPHROSIS    70       1
## 7                                      PEPTIC ULCER    13       0
## 8  PSYCH. SUBSTANCE USE & ACCIDENTAL DRUG POISONING   111       2
## 9                                        SEPTICEMIA    36       1
## 10                              SHORT GESTATION/LBW    35       1

Analysis

Because the data were already provided in a fairly structured format, it did not require a lot of tidying in a sense of removing any rows, renaming headers, etc. I did notice there were some repeated rows, so I simply passed the distinct function while “piping” through my other functions.

Show leading causes of death each year for men and women

I sorted the data using group_by, and then found out about the tally function, which makes for a quick and easy way to sum up data for the column that is passed into the function. Any grouped data is maintained, so the combination of these two functions quickly gave me the total number for each cause of death. Using top_n, I selected the highest of each one of these vales, and then arranged the data from earliest year to most recent.

gender_cause <- nyc_deaths %>%
  distinct %>%
  group_by(Year, Sex, Cause.of.Death) %>%
  tally(Count, sort=TRUE) %>%
  top_n(1, n) %>%
  arrange(Year)

For each year, Diseases of Heart is the clear winner, for both males and females:

gender_cause

## Source: local data frame [10 x 4]
## Groups: Year, Sex [10]
## 
##     Year    Sex    Cause.of.Death     n
##    (int)  (chr)             (chr) (int)
## 1   2007 FEMALE DISEASES OF HEART 11618
## 2   2007   MALE DISEASES OF HEART  9577
## 3   2008 FEMALE DISEASES OF HEART 11462
## 4   2008   MALE DISEASES OF HEART  9456
## 5   2009 FEMALE DISEASES OF HEART 10630
## 6   2009   MALE DISEASES OF HEART  9173
## 7   2010 FEMALE DISEASES OF HEART  9343
## 8   2010   MALE DISEASES OF HEART  8344
## 9   2011 FEMALE DISEASES OF HEART  9009
## 10  2011   MALE DISEASES OF HEART  7713

Show leading causes of death each year for each ethnic group

Similar to the question above, I used a combination of group_by, tally, and top_n to arrange and sum the data to get the required answer. However, this time I used “Ethnicity” instead of gender so that I could filter by ethnic group.

ethnic_cause <- nyc_deaths %>%
  distinct %>%
  group_by(Year, Ethnicity, Cause.of.Death) %>%
  tally(Count, sort=TRUE) %>%
  top_n(1, n)

Below we can see that, as expected, “Diseases of Heart” appears the most. The only other cause of death that comes up, oddly enough is “Malignant Neoplasms”, both of which appeared for Asian and Pacific Islanders:

ethnic_cause

## Source: local data frame [20 x 4]
## Groups: Year, Ethnicity [20]
## 
##     Year                Ethnicity      Cause.of.Death     n
##    (int)                    (chr)               (chr) (int)
## 1   2007 ASIAN & PACIFIC ISLANDER   DISEASES OF HEART   925
## 2   2007                 HISPANIC   DISEASES OF HEART  2745
## 3   2007       NON-HISPANIC BLACK   DISEASES OF HEART  4843
## 4   2007       NON-HISPANIC WHITE   DISEASES OF HEART 12682
## 5   2008 ASIAN & PACIFIC ISLANDER   DISEASES OF HEART  1002
## 6   2008                 HISPANIC   DISEASES OF HEART  2775
## 7   2008       NON-HISPANIC BLACK   DISEASES OF HEART  4802
## 8   2008       NON-HISPANIC WHITE   DISEASES OF HEART 12339
## 9   2009 ASIAN & PACIFIC ISLANDER   DISEASES OF HEART  1004
## 10  2009                 HISPANIC   DISEASES OF HEART  2731
## 11  2009       NON-HISPANIC BLACK   DISEASES OF HEART  4603
## 12  2009       NON-HISPANIC WHITE   DISEASES OF HEART 11465
## 13  2010 ASIAN & PACIFIC ISLANDER MALIGNANT NEOPLASMS   943
## 14  2010                 HISPANIC   DISEASES OF HEART  2671
## 15  2010       NON-HISPANIC BLACK   DISEASES OF HEART  4297
## 16  2010       NON-HISPANIC WHITE   DISEASES OF HEART  9846
## 17  2011 ASIAN & PACIFIC ISLANDER MALIGNANT NEOPLASMS  1004
## 18  2011                 HISPANIC   DISEASES OF HEART  2549
## 19  2011       NON-HISPANIC BLACK   DISEASES OF HEART  4083
## 20  2011       NON-HISPANIC WHITE   DISEASES OF HEART  9236

Calculate which cause of death has declined the most and which has increased the most in years given.

cause_change <- nyc_deaths %>%
  distinct %>%
  filter(Year == 2007 | Year == 2011) %>%
  group_by(Year, Cause.of.Death) %>%
  tally(Count) %>%
  spread(Year, n) %>%
  mutate(Pct_Change= ((`2011` - `2007`) / `2007`) * 100)

most <- arrange(cause_change, desc(Pct_Change))
least <- arrange(cause_change, Pct_Change)

After filtering the data and creating the calculated column, the same object was used to create the most and least objects, each one basically showing the data sorted a different way.

There has been a large increase in both Alzheimers and Parkinson’s diseases, this may be due to better dectection methods. Interestingly enough, the number one cause of death across all years (diseases of heart) is actually in the group with the largest decline.

# Increased the most
head(most, 10)

## Source: local data frame [10 x 4]
## 
##                               Cause.of.Death  2007  2011 Pct_Change
##                                        (chr) (int) (int)      (dbl)
## 1                         ALZHEIMERS DISEASE   276   619  124.27536
## 2                         PARKINSONS DISEASE   100   185   85.00000
## 3                                    ANEMIAS    37    67   81.08108
## 4                               PEPTIC ULCER    71    94   32.39437
## 5         CHRONIC LOWER RESPIRATORY DISEASES  1411  1758   24.59249
## 6  ESSENTIAL HYPERTENSION AND RENAL DISEASES   782   955   22.12276
## 7        CHRONIC LIVER DISEASE AND CIRRHOSIS   445   542   21.79775
## 8                          DIABETES MELLITUS  1535  1735   13.02932
## 9                    CEREBROVASCULAR DISEASE  1547  1740   12.47576
## 10                   INFLUENZA AND PNEUMONIA  2219  2463   10.99594

# Decreased the most
head(least, 10)

## Source: local data frame [10 x 4]
## 
##                                      Cause.of.Death  2007  2011 Pct_Change
##                                               (chr) (int) (int)      (dbl)
## 1      CARDIOVASCULAR DISORDERS IN PERINATAL PERIOD    76    24 -68.421053
## 2              HUMAN IMMUNODEFICIENCY VIRUS DISEASE  1095   747 -31.780822
## 3                    AORTIC ANEURYSM AND DISSECTION   222   157 -29.279279
## 4          PREGNANCY, CHILDBIRTH AND THE PUERPERIUM    33    25 -24.242424
## 5       CHOLELITHIASIS AND DISORDERS OF GALLBLADDER    27    21 -22.222222
## 6                                 DISEASES OF HEART 21195 16722 -21.104034
## 7                                   ATHEROSCLEROSIS   171   143 -16.374269
## 8  PSYCH. SUBSTANCE USE & ACCIDENTAL DRUG POISONING   834   748 -10.311751
## 9            MENTAL DISORDERS DUE TO USE OF ALCOHOL   211   200  -5.213270
## 10                                       SEPTICEMIA   387   375  -3.100775

Calculate which cause of death has remained stable over the years given.

For this question, I again used the cause_change object that I created, and simply filtered out anything that was above or below the 10% change in rate from 2007 to 2011. Short of using more scientific methods to calculate stability, I took the causes of death whose numbers remained the same or close to it from 2007 to 2011. Note, this does not calculate a year-over-year average or similar measure.

What is probably most interesting about this data is that many of these causes can be attributed to human behavior (suicide, homicide, alcohol use). I guess humans are creatures of habit…

stable <- cause_change %>%
  filter(Pct_Change >= -10 & Pct_Change <= 10) %>%
  arrange(Pct_Change)

stable

## Source: local data frame [11 x 4]
## 
##                                 Cause.of.Death  2007  2011 Pct_Change
##                                          (chr) (int) (int)      (dbl)
## 1       MENTAL DISORDERS DUE TO USE OF ALCOHOL   211   200  -5.213270
## 2                                   SEPTICEMIA   387   375  -3.100775
## 3              ACCIDENTS EXCEPT DRUG POISONING  1016  1002  -1.377953
## 4        CONGENITAL MALFORMATIONS,DEFORMATIONS   232   232   0.000000
## 5                          SHORT GESTATION/LBW    92    92   0.000000
## 6                              VIRAL HEPATITIS   357   363   1.680672
## 7                          MALIGNANT NEOPLASMS 13104 13336   1.770452
## 8              INTENTIONAL SELF-HARM (SUICIDE)   470   485   3.191489
## 9  NEPHRITIS, NEPHROTIC SYNDROME AND NEPHROSIS   430   445   3.488372
## 10                          ASSAULT (HOMICIDE)   503   521   3.578529
## 11              BENIGN AND UNCERTAIN NEOPLASMS   242   263   8.677686

Generator Capacity Prices

The last data set I chose was the “Generator Capacity Prices”, provided by Daniel Smilowitz. Because of the double header rows, I actually read in the whole .csv and set the header parameter to FALSE, then captured the first two rows and pasted them together, renamed the first column header, and then used this to re-name the headers for the generator data. After doing this, re-arranging and tidying the data became much easier.

# open file
path2 <- ("https://raw.githubusercontent.com/Logan213/DATA607_Project2/master/UCAP.csv")
con2 <- file(path2, open="r")
generate <- read.csv(con2, header = FALSE, stringsAsFactors = FALSE)

# close file
close(con2)

row1 <- generate[1,]
row2 <- generate[2,]
gen_header <- paste(row1, row2, sep= " ")
gen_header[1] <- "Date"

generate <- slice(generate, c(-1,-2))
names(generate) <- gen_header

generate <- generate %>%
  gather("Type", "Amount", 2:13) %>%
  separate(Date, c("Month", "Year")) %>%
  separate(Type, c("Auction", "Location"))

generate$Amount <- as.numeric(sub("\\$","", generate$Amount))

## Warning: NAs introduced by coercion

head(generate, 5)

##   Month Year Auction Location Amount
## 1   Nov   03 Monthly      NYC   6.67
## 2   Dec   03 Monthly      NYC   6.67
## 3   Jan   04 Monthly      NYC   6.67
## 4   Feb   04 Monthly      NYC   6.95
## 5   Mar   04 Monthly      NYC   6.25

Analysis

These questions were a little more difficult to arrange the data for in order to give an answer, but I believe I retrieved the necessary results.

Which month of the year sees the highest prices in each location?

For this question, I used an average of all years provided. Using a combination for group_by and summarise, I was able to quickly filter and arrange the generator data to display the average price by month and location. Then, using top_n, I selcted the number one value (of price) by region.

high_price_mo <- generate %>%
  filter(Amount != " ") %>%
  group_by(Location, Month) %>%
  summarise(avg = mean(Amount)) %>%
  top_n(1, Month)

Below is the summarized data - looks like September is a very expensive month across the board for all regions. Of course, the NYC region is the most expensive by far.

high_price_mo

## Source: local data frame [4 x 3]
## Groups: Location [4]
## 
##   Location Month       avg
##      (chr) (chr)     (dbl)
## 1      LHV   Sep  3.476667
## 2       LI   Sep  4.790278
## 3      NYC   Sep 11.892500
## 4      ROS   Sep  2.580833

What is the avg. difference between NYC and ROS prices?

To find the average price difference between the NYC and “rest of state” regions, I first filtered the data to get rid of the null values, and then select only the regions relevant to the question. Using a combination of group_by and then spread functions, I arranged the data so that I could add a calculated column which showed the difference in price paid to generators for the two regions.

Using tally, I grouped and summed the months together, and then used summarise to come up with the average difference for each month.

nyc_vs_ros <- generate %>%
  filter(Amount != " ") %>%
  filter(Location == "NYC" | Location == "ROS") %>%
  group_by(Month, Year) %>%
  spread(Location, Amount) %>%
  mutate(Price_diff = NYC - ROS) %>%
  tally(Price_diff) %>%
  summarise(Avg_dif = mean(n)) %>%
  arrange(desc(Avg_dif))

We can see the results in the data below, surprisingly the coldest months of the year have the lowest differential in price.:

nyc_vs_ros

## Source: local data frame [12 x 2]
## 
##    Month  Avg_dif
##    (chr)    (dbl)
## 1    May 29.32833
## 2    Jun 28.77750
## 3    Oct 28.36250
## 4    Aug 28.12667
## 5    Jul 28.11750
## 6    Sep 27.93500
## 7    Mar 13.21692
## 8    Apr 12.64231
## 9    Nov 12.52538
## 10   Feb 12.23000
## 11   Dec 12.22923
## 12   Jan 11.50846

Which calendar year saw the highest average price across regions (ignoring weighting)?

To find the answer to this question, I grouped the data by Year and Location, and then used summarise to “compress” each month of the year into an average price for that year. I then spread this data across four columns, and then used mutate to add a calculated column which sums the four regions and divides by 4 to get the average. I had to use this method as using mean again was causing some issues.

high_by_region <- generate %>%
  filter(Amount != " ") %>%
  group_by(Year, Location) %>%
  summarise(Avg_Price = mean(Amount)) %>%
  spread(Location, Avg_Price) %>%
  mutate(ALL_Regions = (sum(LHV+LI+NYC+ROS) / 4))

Below we can see the average price paid to generators across each of the four regions, with the All_Regions column added. This column is an average of the four regions, and we can see that 2012 had the highest average price paid to generators.

high_by_region

## Source: local data frame [14 x 6]
## Groups: Year [14]
## 
##     Year       LHV        LI       NYC       ROS ALL_Regions
##    (chr)     (dbl)     (dbl)     (dbl)     (dbl)       (dbl)
## 1     03 1.4650000 4.9766667  6.733333 1.4650000    3.660000
## 2     04 1.2280556 7.0438889  9.044444 1.2280556    4.636111
## 3     05 0.8172222 6.8991667  9.228889 0.8172222    4.440625
## 4     06 1.8297222 5.4408333  9.251667 1.8297222    4.587986
## 5     07 2.4558333 4.6327273  9.098056 2.4558333    4.660612
## 6     08 2.2494444 2.6559375  4.825556 2.2494444    2.995095
## 7     09 2.3513889 2.4313889  4.934167 2.3513889    3.017083
## 8     10 1.6077778 1.6736111  9.055278 1.6077778    3.486111
## 9     11 0.3433333 0.3438889  7.096111 0.3433333    2.031667
## 10    12 1.1208333 1.4747222  7.958056 1.1208333    2.918611
## 11    13 3.6122222 4.7125000 10.846389 3.6122222    5.695833
## 12    14 7.6408333 4.9508333 13.245000 4.2941667    7.532708
## 13    15 6.5630556 4.0811111 11.573889 2.7991667    6.254306
## 14    16 3.4170000 1.5960000  6.256000 1.1960000    3.116250

IS 607 Project 2

Logan Thomson

March 13, 2016

Load Libraries

Basketball Wins

Tidying Data

Analysis

Cumulative wins over lifetime of NBA/BBA

Average wins over lifetime of NBA/BBA

Most wins in a season in franchise history

Most wins by team by year

Leading Causes of Death in NYC

Analysis

Show leading causes of death each year for men and women

Show leading causes of death each year for each ethnic group

Calculate which cause of death has declined the most and which has increased the most in years given.

Calculate which cause of death has remained stable over the years given.

Generator Capacity Prices

Analysis

Which month of the year sees the highest prices in each location?

What is the avg. difference between NYC and ROS prices?

Which calendar year saw the highest average price across regions (ignoring weighting)?