Tidy data and variable types

Setting up the environment: Normally hidden, but displayed here.

#Load library to import from Excel
library(xlsx)

Untidy Data

A Hypothetica dataset was created using NBA players and thier fictinal Points Per Game (PPG) averages over the past four seasons. Note: you can tell it’s fictional because NBA players rarely play for the same team four years in a row. Also note: one of the players, Karl Anthony-Townes has only been in the NBA for 2 years.

#Import Untidy data table
UntidyNBA <- read.xlsx("Untidy_tidy.xlsx", sheetName = "Untidy" )
#Print it out
options(width = 100)
UntidyNBA
                  Player          Team X2017.PPG X2016.PPG X2015.PPG X2014.PPG
1      Russell Westbrook Oklahoma City      31.6      30.2      31.2      30.0
2           James Harden       Houston      29.1      28.0      29.0      29.7
3           LeBron James     Cleveland      26.4      27.7      28.7      28.0
4          Anthony Davis   New Orleans      28.0      26.0      27.0      27.7
5           Kevin Durant  Goldes State      25.1      25.7      26.7      27.0
6       Demarcus Cousins   New Orleans      27.0      25.0      26.0      23.0
7     Karl-Anthony Towns     Minnesota      25.1      24.0        NA        NA
8  Giannis Antetokuonmpo     Milwaukee      27.0      22.0      23.0      25.3
9              John Wall    Washington      25.1      23.3      24.3      22.1
10         Osaiah Thomas     Cleveland      22.9      20.1      21.1      22.0

Tidy up the Data

The same data, but in tidy format. Each row is an observation and each column is a variable.

#Import Tidy version of the same data table
TidyNBA <- read.xlsx("Untidy_tidy.xlsx", sheetName = "Tidy" )
#Print it out
TidyNBA
                  Player          Team Year  PPG
1          Anthony Davis   New Orleans 2017 28.0
2          Anthony Davis   New Orleans 2016 26.0
3          Anthony Davis   New Orleans 2015 27.0
4          Anthony Davis   New Orleans 2014 27.7
5       Demarcus Cousins   New Orleans 2017 27.0
6       Demarcus Cousins   New Orleans 2016 25.0
7       Demarcus Cousins   New Orleans 2015 26.0
8       Demarcus Cousins   New Orleans 2014 23.0
9  Giannis Antetokuonmpo     Milwaukee 2017 27.0
10 Giannis Antetokuonmpo     Milwaukee 2016 22.0
11 Giannis Antetokuonmpo     Milwaukee 2015 23.0
12 Giannis Antetokuonmpo     Milwaukee 2014 25.3
13          James Harden       Houston 2017 29.1
14          James Harden       Houston 2016 28.0
15          James Harden       Houston 2015 29.0
16          James Harden       Houston 2014 29.7
17             John Wall    Washington 2017 25.1
18             John Wall    Washington 2016 23.3
19             John Wall    Washington 2015 24.3
20             John Wall    Washington 2014 22.1
21    Karl-Anthony Towns     Minnesota 2017 25.1
22    Karl-Anthony Towns     Minnesota 2016 24.0
23          Kevin Durant  Goldes State 2017 25.1
24          Kevin Durant  Goldes State 2016 25.7
25          Kevin Durant  Goldes State 2015 26.7
26          Kevin Durant  Goldes State 2014 27.0
27          LeBron James     Cleveland 2017 26.4
28          LeBron James     Cleveland 2016 27.7
29          LeBron James     Cleveland 2015 28.7
30          LeBron James     Cleveland 2014 28.0
31         Osaiah Thomas     Cleveland 2017 22.9
32         Osaiah Thomas     Cleveland 2016 20.1
33         Osaiah Thomas     Cleveland 2015 21.1
34         Osaiah Thomas     Cleveland 2014 22.0
35     Russell Westbrook Oklahoma City 2017 31.6
36     Russell Westbrook Oklahoma City 2016 30.2
37     Russell Westbrook Oklahoma City 2015 31.2
38     Russell Westbrook Oklahoma City 2014 30.0
  • Varibles:
    • Player(Nominal): The name of the NBA Player
    • Team(Nominal): The name of the NBA Team
    • Year(Interval): The year of the NBA Season the player scored the points
    • PPG or Points Per Game(ratio): The number of points the player scored in a season divided by the total number of games played. Normally, we would secure those two variable (total points and games played) to further break down the data, but for this case, PPG should work.

Importing Tidy data and identifying variable types

First, import the sheets from Excel

#Import three tidy data tables
tbl_dailytemp <- read.xlsx("1_dailytemp.xlsx", sheetIndex = 1)
tbl_suicide <- read.xlsx("2_suicide.xls", sheetIndex = 1)
tbl_BreastCxSurv <- read.xlsx("3_BrCxSurvival.xlsx", sheetIndex = 1)

Next, print out each dataset:

Daily High Temperature for 10 days in Hershey, PA.

The data was collected from the archives of the Weather Channel (www.weather.com). The high temperature was manually entered into an Excel table and imported into R.

  • The Variables:
    • date (interval): The day the high temperature was recorded
    • hightest_temp (interval): The high in farenheit as recorded on weather.com for the Hershey, PA area.
#Print Daily Temp
tbl_dailytemp
         date hightest_temp_f
1  2017-09-18              82
2  2017-09-17              80
3  2017-09-16              76
4  2017-09-15              77
5  2017-09-14              72
6  2017-09-13              75
7  2017-09-12              76
8  2017-09-11              77
9  2017-09-10              75
10 2017-09-09              76

Total number of suicides in the US over 10 years.

The data was gathered from the National Center for Health Statistics (NCHS) (https://www.cdc.gov/nchs/index.htm). The table of violent deaths, which includes suicides, was imported into Excel from .csv format. The data was filtered to only include suicides at a summary level for a 10 year period. The summary information was imported into R.

  • The Variables
    • year (Interval): The year the event reportedly occurred.
    • suicide (ratio): The count of the number of reported suicides
#Print Suicide Rate
tbl_suicide
   year suicides
1  2006    33300
2  2007    34598
3  2008    36035
4  2009    36909
5  2010    38364
6  2011    39518
7  2012    40600
8  2013    41149
9  2014    42826
10 2015    44193

Five-year Breast Cancer survival rates (in 6 month increments) from Cancer Treatment Centers of America.

A graph from the Cancer Treatment Centers of America was keyed into Excel (http://www.cancercenter.com/breast-cancer/statistics/)

  • The Variables
    • Years(Interval): The amount of time from initial diagnosis of metastatic breast cancer, in six month intervals.
    • CTCA_BreastCancer_Survivial(ratio): The ratio of the number of survivors at the end of the period divided by the total number of patients diagnosed. For analytical purposes, I would prefer to have the actual variables ((number of survivoros at the end of the period), and (total number diagnosed in that same cohort)).
#Load pander to better control tables and rename column headers. Need to upgrade rStudio to latest version for this to work!
library(pander)
#Format Column headers and print breast cancer survival rates
colnames(tbl_BreastCxSurv) <- c("Years from Dx", "Survival Percent") 
tbl_BreastCxSurv
   Years from Dx Survival Percent
1            0.5             0.95
2            1.0             0.85
3            1.5             0.76
4            2.0             0.69
5            2.5             0.61
6            3.0             0.54
7            3.5             0.47
8            4.0             0.43
9            4.5             0.36
10           5.0             0.33