Data Transformation

Below three of the “wide” datasets identified in the Week 6 Discussion items have been used for this exercise.

Set 1 - Bank stocks from 2007 (Discussion Thread by Jeff Shamp)
Set 2 - UNICEF dataset on Under 5 Mortality (Discussion Thread by Samuel Bellows)
Set 3 - Hospital Consumer Assessment of Healthcare Providers and Systems response data by state (Discussion Thread by Thomas Hill)

We will practice tidying and transformations on these data sets.and We will performing few analysis points as discussed/requested in the discussion threads.

Bank Stocks

Reading the CSV file from GIT repository and loading into dataframe:

theUrl <- "https://raw.githubusercontent.com/kamathvk1982/Data607-MajorAssignment-Project2/master/banks.csv"
banks.full.df <- read.csv(file = theUrl, header = T , sep = ',', na.strings=c("NA","NaN", "") )

# Creating new data frame with reduced columns for current analysis:
banks.df <-  banks.full.df %>%
  select(c(date=Bank.Ticker, date.for.split=Bank.Ticker, bac.close=BAC.3, bac.volume=BAC.4, c.close=C.3, c.volume=C.4, jpm.close=JPM.3, jpm.volume=JPM.4, gs.close=GS.3, gs.volume=GS.4)) %>%
  filter(date!= 'Stock Info' ,  date!= 'Date'  ) %>%
  separate(date.for.split,  c("date.year", "date.month")) %>%
  unite("date.year.month", date.year:date.month, sep='-')

kable(head(banks.df))

date	date.year.month	bac.close	bac.volume	c.close	c.volume	jpm.close	jpm.volume	gs.close	gs.volume
2006-01-03	2006-01	47.08	16296700	492.9	1537660	40.19	12839400	128.87	6188700
2006-01-04	2006-01	46.58	17757900	483.8	1871020	39.62	13491800	127.09	4862000
2006-01-05	2006-01	46.64	14970900	486.2	1143160	39.74	8109400	127.04	3717600
2006-01-06	2006-01	46.57	12599800	486.2	1370250	40.02	7966900	128.84	4319600
2006-01-09	2006-01	46.6	15620000	483.9	1680740	40.67	16575200	130.39	4723500
2006-01-10	2006-01	46.21	15634800	485.4	1365960	40.73	16614900	132.03	5539800

# Next, we will tidy the data by reshaping the data layput in the table by using tidyr->gather function:  
banks.tidy.df <- gather(banks.df, key = "key", value = "value", bac.close , bac.volume , c.close , c.volume , jpm.close , jpm.volume , gs.close , gs.volume )

banks.tidy.df$value  <- as.numeric(as.character(banks.tidy.df$value))

kable(head(banks.tidy.df))

date	date.year.month	key	value
2006-01-03	2006-01	bac.close	47.08
2006-01-04	2006-01	bac.close	46.58
2006-01-05	2006-01	bac.close	46.64
2006-01-06	2006-01	bac.close	46.57
2006-01-09	2006-01	bac.close	46.60
2006-01-10	2006-01	bac.close	46.21

Analysis 1

Piloting the closing balance and volume traded of each bank using point chart:

plot.price <- banks.tidy.df %>%
  filter(grepl('close', key) ) %>%
   group_by(key,date.year.month) %>%
     summarize(mean.price= mean(as.double(value)))  %>%
       ggplot(aes(x=date.year.month, y=mean.price, colour=key)) + 
  theme(axis.text.x = element_text(angle = 90, size = 2)) +
  geom_point() 

plot.volume <- banks.tidy.df %>%
  filter(grepl('volume', key) ) %>%
   group_by(key,date.year.month) %>%
     summarize(mean.volume= mean(as.double(value)))  %>%
       ggplot(aes(x=date.year.month, y=log(mean.volume), colour=key)) + 
  theme(axis.text.x = element_text(angle = 90, size = 2)) +
  geom_point()    

plot.price

plot.volume

Based on the above point chart we can see how the stock prices for banks have been impacted during recession.

Analysis 2

Comparing for Citi and JP Morgan; Getting the mean of the prices for the year 2008; the peak of the recession:

data.citi.2008 <- banks.tidy.df %>%
  filter(grepl('c.close', key) , grepl('2008', date.year.month)) %>%
   separate(date.year.month,  c("date.year", "date.month"))  

data.jpm.2008 <- banks.tidy.df %>%
  filter(grepl('jpm.close', key) , grepl('2008', date.year.month)) %>%
   separate(date.year.month,  c("date.year", "date.month"))  

#52 week data for Citi for 2008
summary(as.double(data.citi.2008$value))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.25   32.91   44.59  110.62  196.97  296.90

boxplot(as.double(data.citi.2008$value), main="CITI 2008", col = "green")

#52 week data for JPM for 2008
summary(as.double(data.jpm.2008$value))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.72   37.00   40.49   39.83   43.46   49.85

boxplot(as.double(data.jpm.2008$value), main="JPM 2008", col = "purple")

Based on the above box plot shapes we can say that Citi had a much bigger or higher price dip then JPM.

UNICEF

Reading the CSV file from GIT repository and loading into dataframe:

theUrl <- "https://raw.githubusercontent.com/kamathvk1982/Data607-MajorAssignment-Project2/master/unicef-u5mr.csv"
unicef.full.df <- read.csv(file = theUrl, header = T , sep = ',', na.strings=c("NA","NaN", "") )
dim((unicef.full.df))

## [1] 196  67

Data Transformation and Tidy using dplyr and tidyr:

# Next, we will tidy the data by reshaping the data layput in the table by using tidyr->gather function:
unicef.tidy.df <- gather(unicef.full.df, key = "Year", value = "Value", -CountryName) 
unicef.tidy.df$Value  <- as.numeric(as.character(unicef.tidy.df$Value))
unicef.tidy.df$CountryName  <- str_trim(as.character(unicef.tidy.df$CountryName))

# Use the tidyr->drop_na function to drop the row on column Status having NA value:
unicef.tidy.df <- drop_na(unicef.tidy.df, Value)    

# Use the sub  function to drop the 'U5MR.' from new column Year:
unicef.tidy.df$Year  <- sub('U5MR.','',unicef.tidy.df$Year)

kable(head(unicef.tidy.df))

	CountryName	Year	Value
9	Australia	1950	31.6
31	Canada	1950	48.7
48	Benin	1950	348.2
49	Denmark	1950	34.1
51	Dominican Republic	1950	156.0
58	Fiji	1950	135.7

Analysis 1

Comparing a Developed Nation United States of America , a Developing nation India and a Under Developed Nation Chad:

unicef.set1.df <-  unicef.tidy.df %>%
  filter(grepl('United States of America|India|Chad' , CountryName  )) %>%
  arrange(Year, CountryName)

kable(head(unicef.set1.df))

CountryName	Year	Value
United States of America	1950	37.7
United States of America	1951	36.6
United States of America	1952	35.6
India	1953	279.5
United States of America	1953	34.7
India	1954	274.5

unicef.set1.df %>%
       ggplot(aes(x=Year, y=Value, colour=CountryName)) + 
  theme(axis.text.x = element_text(angle = 90, size = 2)) +
  geom_point()

Based on the above Chart, we can see that the under 5 mortality is coming down for all three countries; but comparatively the counts are still high for Under Developed Countries and for Developing Countries .

Analysis 2

List of countries where the under 5 mortality as of 2015 is still greater then 85:

# Filtering for 2015 and greater then 85:
unicef.set2.df <-  unicef.tidy.df %>%
  filter( Year=='2015' ,  Value > 85 ) %>%
  arrange( desc(Value))


# Ploting on world map:
data(wrld_simpl)
myCountries = wrld_simpl@data$NAME %in% names(table(unicef.set2.df$CountryName))
plot(wrld_simpl, col = c(gray(.80), "red")[myCountries+1])

Based on the above World Map plotting we can see that most of these countries are in Continent Africa.

Hospital Consumer Assessment

Reading the CSV file from GIT repository and loading into dataframe:

theUrl <- "https://raw.githubusercontent.com/kamathvk1982/Data607-MajorAssignment-Project2/master/HCAHPS.csv"
hcahps.full.df <- read.csv(file = theUrl, header = T , sep = ',', na.strings=c("NA","NaN", "Not Available") )

hcahps.full.df$HCAHPS.Answer.Percent  <- as.numeric(as.character(hcahps.full.df$HCAHPS.Answer.Percent))

kable(head(hcahps.full.df))

State	HCAHPS.Question	HCAHPS.Measure.ID	HCAHPS.Answer.Description	HCAHPS.Answer.Percent	Footnote	Start.Date	End.Date
AK	Patients who reported that their nurses “Always” communicated well	H_COMP_1_A_P	Nurses “always” communicated well	79	NA	04/01/2018	03/31/2019
AK	Patients who reported that their nurses “Sometimes” or “Never” communicated well	H_COMP_1_SN_P	Nurses “sometimes” or “never” communicated well	4	NA	04/01/2018	03/31/2019
AK	Patients who reported that their nurses “Usually” communicated well	H_COMP_1_U_P	Nurses “usually” communicated well	17	NA	04/01/2018	03/31/2019
AK	Patients who reported that their nurses “Always” treated them with courtesy and respect	H_NURSE_RESPECT_A_P	Nurses “always” treated them with courtesy and respect	86	NA	04/01/2018	03/31/2019
AK	Patients who reported that their nurses “Sometimes” or “Never” treated them with courtesy and respect	H_NURSE_RESPECT_SN_P	Nurses “sometimes” or “never” treated them with courtesy and respect	2	NA	04/01/2018	03/31/2019
AK	Patients who reported that their nurses “Usually” treated them with courtesy and respect	H_NURSE_RESPECT_U_P	Nurses “usually” treated them with courtesy and respect	12	NA	04/01/2018	03/31/2019

Analysis 1

Measure RESPECT (treated patients with courtesy and respect) for Nurses and Doctors for NJ and near by States:

# we will create the required dataset using select, filter and separate function for data transformation:
respect.df <-  hcahps.full.df %>%
  select(c(State, Measure.ID = HCAHPS.Measure.ID,Answer.Percent=HCAHPS.Answer.Percent )) %>%
  filter(grepl('CT|NY|PA|NJ' , State    ) ,  grepl('RESPECT' , Measure.ID   ) ) %>%
  separate(Measure.ID,  c("Type", "Response"), sep = '_RESPECT_')


# Next, we will tidy the data by reshaping the data layput in the table by using tidyr->spread function:
respect.tidy.df <- spread(respect.df, key = Response, value = Answer.Percent  )
colnames(respect.tidy.df) <- c("State", "Type", "Always", "Sometimes.or.Never", "Usually")

kable(respect.tidy.df)

State	Type	Always	Sometimes.or.Never	Usually
CT	H_DOCTOR	84	3	13
CT	H_NURSE	85	4	11
NJ	H_DOCTOR	83	4	13
NJ	H_NURSE	84	4	12
NY	H_DOCTOR	84	4	12
NY	H_NURSE	84	4	12
PA	H_DOCTOR	87	3	10
PA	H_NURSE	88	2	10

respect.tidy.df %>% group_by(State, Type) %>% summarise (Positive.Ind = sum(as.integer(Always)+as.integer(Usually)))

If we treat “Usually” and “Always” as a positive indicator then we can say that in State of CT the Doctors did better and in the State of PA the Nurses did better; for NJ and NY state the Doctors and Nurses were voted equally.

Analysis 2

Measure RATING (a rating of 9 or 10 on a scale from 0 (lowest) to 10 (highest)) for HOSPITALS for NJ and near by States:

# we will create the required dataset using select, filter and separate function for data transformation:
rating.df <-  hcahps.full.df %>%
  select(c(State, Measure.ID = HCAHPS.Measure.ID,Answer.Percent=HCAHPS.Answer.Percent )) %>%
  filter(grepl('CT|NY|PA|NJ' , State    ) ,  grepl('RATING' , Measure.ID    ) ) %>%
  separate(Measure.ID,  c("Type", "Response"), sep = '_RATING_')

# Next, we will tidy the data by reshaping the data layput in the table by using tidyr->spread function:
rating.tidy.df <- spread(rating.df, key = Response, value = Answer.Percent  )
colnames(rating.tidy.df) <- c("State", "Type", "low.6.or.Lower", "medium.7.or.8", "high.9.or.10")

kable(rating.tidy.df)

State	Type	low.6.or.Lower	medium.7.or.8	high.9.or.10
CT	H_HSP	10	20	70
NJ	H_HSP	11	23	66
NY	H_HSP	10	24	66
PA	H_HSP	8	20	72

rating.tidy.df %>% group_by(State, Type) %>% summarise (Positive.Ind = sum(as.integer(medium.7.or.8)+as.integer(high.9.or.10)))

If we treat “rating of 7 or 8 [medium]” and “rating of 9 or 10 [high]” as a positive indicator then we can say that in State of PA the Hospitals did better then other states.

Analysis 3

Measure Doctors and Nurses across applicable questions in all States:

# we will create the required dataset using select, filter and separate function for data transformation.
# BY looking at the data values we can say that in column Measure.ID any pattern of DOCTOR or COMP_2 can be treated as DOCTOR data
# and any pattern of NURSE or COMP_1 can be treated as NURSE data:  

compare.doctor.df <-  hcahps.full.df %>%
  select(c(State, Measure.ID = HCAHPS.Measure.ID,Measure.ID2 = HCAHPS.Measure.ID, Answer.Percent=HCAHPS.Answer.Percent )) %>%
  filter(   grepl('DOCTOR|COMP_2' , Measure.ID  ), grepl('_A_P|_U_P' , Measure.ID2  ) ) 

compare.nurse.df <-  hcahps.full.df %>%
  select(c(State, Measure.ID = HCAHPS.Measure.ID,Measure.ID2 = HCAHPS.Measure.ID, Answer.Percent=HCAHPS.Answer.Percent )) %>%
  filter(   grepl('NURSE|COMP_1' , Measure.ID   ), grepl('_A_P|_U_P' , Measure.ID2  ) )

Next, we will calculate the sum of the scores:

## [1] "Total of Positive Response Score for Doctors is 19842"

## [1] "Total of Positive Response Score for Nurses is 19934"

If we treat “Usually [_U_P]” and “Always [_A_P]” as a positive indicator then we can say that the Nurses had a better result then Doctors across all states

Data607-MajorAssignment-Project2-Data Transformation

Vinayak Kamath

3/7/2020

Data Transformation

Bank Stocks

Analysis 1

Analysis 2

UNICEF

Analysis 1

Analysis 2

Hospital Consumer Assessment

Analysis 1

Analysis 2

Analysis 3