project2-–Tidyr and deployr

Assignment:-
The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.:

Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets:
—–Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
—-Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
—-Perform the analysis requested in the discussion item.
—-Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
Please include in your homework submission, for each of the three chosen datasets:
The URL to the .Rmd file in your GitHub repository, and  The URL for your rpubs.com web page.

Dataset1:- 1984 United States Congressional Voting Records

library(stringr)
library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(knitr)
library(stringr)
library(ggplot2)
votedata <- read.csv("/Users/ashishsm1986/git/Cuny-Assignments/week6-assignment/house-votes-data.txt", header = F, stringsAsFactors = F)
#Add names to the columns
vd_names <- readLines("/Users/ashishsm1986/git/Cuny-Assignments/week6-assignment/house-votes-names.txt")[33:49]
vd_names <- str_extract(vd_names,"[[:alpha:]-]{2,40}")
vd_names[1] <- "party"
names(votedata) <- vd_names

kable(head(votedata)) #Making sure data is read successfully

party	handicapped-infants	water-project-cost-sharing	adoption-of-the-budget-resolution	physician-fee-freeze	el-salvador-aid	religious-groups-in-schools	anti-satellite-test-ban	aid-to-nicaraguan-contras	mx-missile	immigration	synfuels-corporation-cutback	education-spending	superfund-right-to-sue	crime	duty-free-exports	export-administration-act-south-africa
republican	n	y	n	y	y	y	n	n	n	y	?	y	y	y	n	y
republican	n	y	n	y	y	y	n	n	n	n	n	y	y	y	n	?
democrat	?	y	y	?	y	y	n	n	n	n	y	n	y	y	n	n
democrat	n	y	y	n	?	y	n	n	n	n	y	n	y	n	n	y
democrat	y	y	y	n	y	y	n	n	n	n	y	?	y	y	y	y
democrat	n	y	y	n	y	y	n	n	n	n	n	n	y	y	y	y

Tidy the data, Get most number of votes for different agendas

agreement <- votedata %>% 
  gather("issue", "vote", 2:17) %>%
  filter(vote %in% c("y", "n")) %>% 
  group_by(party, issue, vote) %>% 
  summarise(vote_count = n()) %>% 
  spread(vote, vote_count) %>% 
  rename(no_votes = n, yes_votes = y) %>% 
  mutate(total_votes = no_votes + yes_votes, 
         agreement_rate = pmax(no_votes, yes_votes) / total_votes, 
         'got_affirmative_votes?' = ifelse(yes_votes > no_votes, "true", "false")) %>% 
  select(party, issue, agreement_rate, 'got_affirmative_votes?') 


kable(head(agreement)) #Sampling the data

party	issue	agreement_rate	got_affirmative_votes?
democrat	adoption-of-the-budget-resolution	0.8884615	true
democrat	aid-to-nicaraguan-contras	0.8288973	true
democrat	anti-satellite-test-ban	0.7722008	true
democrat	crime	0.6498054	false
democrat	duty-free-exports	0.6374502	true
democrat	education-spending	0.8554217	false

My Analysis:- Overall, On what issue did both parties voted unified

agreement %>% 
  arrange(desc(agreement_rate)) %>% 
  head(1) %>% 
  kable()

party	issue	agreement_rate	got_affirmative_votes?
republican	physician-fee-freeze	0.9878788	true

My Analysis:- Overall, On what issue did both parties voted least unified

agreement %>% 
  arrange(agreement_rate) %>% 
  head(1) %>% 
  kable()

party	issue	agreement_rate	got_affirmative_votes?
democrat	water-project-cost-sharing	0.5020921	true

Dataset2:- Country Development Indicators.

The data was taken from http://databank.worldbank.org/data/reports.aspx?source=2&series=SI.POV.GINI&country=#
GINI coefficient indicates inequality in wealth distribution, with 0 being complete equality. The bigger the GINI index the worst the wealth distribution.

#import data from csv
gini_data <- read.csv("/Users/ashishsm1986/git/Cuny-Assignments/week6-assignment/GINI_Data.csv", header = TRUE, stringsAsFactors = F)
head(gini_data) #make sure the data is read

##                        Series.Name Series.Code   Country.Name Country.Code
## 1 GINI index (World Bank estimate) SI.POV.GINI    Afghanistan          AFG
## 2 GINI index (World Bank estimate) SI.POV.GINI        Albania          ALB
## 3 GINI index (World Bank estimate) SI.POV.GINI        Algeria          DZA
## 4 GINI index (World Bank estimate) SI.POV.GINI American Samoa          ASM
## 5 GINI index (World Bank estimate) SI.POV.GINI        Andorra          AND
## 6 GINI index (World Bank estimate) SI.POV.GINI         Angola          AGO
##   X1990..YR1990. X2000..YR2000. X2008..YR2008. X2009..YR2009.
## 1             ..             ..             ..             ..
## 2             ..             ..             30             ..
## 3             ..             ..             ..             ..
## 4             ..             ..             ..             ..
## 5             ..             ..             ..             ..
## 6             ..             ..           42.7             ..
##   X2010..YR2010. X2011..YR2011. X2012..YR2012. X2013..YR2013.
## 1             ..             ..             ..             ..
## 2             ..             ..             29             ..
## 3             ..           27.6             ..             ..
## 4             ..             ..             ..             ..
## 5             ..             ..             ..             ..
## 6             ..             ..             ..             ..
##   X2014..YR2014. X2015..YR2015. X2016..YR2016. X2017..YR2017.
## 1             ..             ..             ..             ..
## 2             ..             ..             ..             ..
## 3             ..             ..             ..             ..
## 4             ..             ..             ..             ..
## 5             ..             ..             ..             ..
## 6             ..             ..             ..             ..

#Data is clearly untidy
#We will cleanup the year column names so we can read and make sense out of it.
colnames(gini_data) <- c("Series.Name", "Series.Code", "Country.Name", "Country.Code", 1990, 2000, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017)
head(gini_data)

##                        Series.Name Series.Code   Country.Name Country.Code
## 1 GINI index (World Bank estimate) SI.POV.GINI    Afghanistan          AFG
## 2 GINI index (World Bank estimate) SI.POV.GINI        Albania          ALB
## 3 GINI index (World Bank estimate) SI.POV.GINI        Algeria          DZA
## 4 GINI index (World Bank estimate) SI.POV.GINI American Samoa          ASM
## 5 GINI index (World Bank estimate) SI.POV.GINI        Andorra          AND
## 6 GINI index (World Bank estimate) SI.POV.GINI         Angola          AGO
##   1990 2000 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
## 1   ..   ..   ..   ..   ..   ..   ..   ..   ..   ..   ..   ..
## 2   ..   ..   30   ..   ..   ..   29   ..   ..   ..   ..   ..
## 3   ..   ..   ..   ..   .. 27.6   ..   ..   ..   ..   ..   ..
## 4   ..   ..   ..   ..   ..   ..   ..   ..   ..   ..   ..   ..
## 5   ..   ..   ..   ..   ..   ..   ..   ..   ..   ..   ..   ..
## 6   ..   .. 42.7   ..   ..   ..   ..   ..   ..   ..   ..   ..

#Replace the ".." data values with NA
gini_data[gini_data == '..'] <- NA
head(gini_data) #Much cleaner now!

##                        Series.Name Series.Code   Country.Name Country.Code
## 1 GINI index (World Bank estimate) SI.POV.GINI    Afghanistan          AFG
## 2 GINI index (World Bank estimate) SI.POV.GINI        Albania          ALB
## 3 GINI index (World Bank estimate) SI.POV.GINI        Algeria          DZA
## 4 GINI index (World Bank estimate) SI.POV.GINI American Samoa          ASM
## 5 GINI index (World Bank estimate) SI.POV.GINI        Andorra          AND
## 6 GINI index (World Bank estimate) SI.POV.GINI         Angola          AGO
##   1990 2000 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
## 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA>   30 <NA> <NA> <NA>   29 <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA> 27.6 <NA> <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 6 <NA> <NA> 42.7 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>

# Removing Series.Name Series.Code
GINI <- gini_data %>% gather(Year, GINI_Index, c(5:16), na.rm = TRUE) %>% group_by(Country.Name) %>% select(-c(Series.Name, Series.Code))
head(GINI)

## # A tibble: 6 x 4
## # Groups:   Country.Name [6]
##   Country.Name     Country.Code Year  GINI_Index
##   <chr>            <chr>        <chr> <chr>     
## 1 Bolivia          BOL          1990  42        
## 2 Brazil           BRA          1990  60.5      
## 3 Chile            CHL          1990  57.3      
## 4 Costa Rica       CRI          1990  45.3      
## 5 Egypt, Arab Rep. EGY          1990  32        
## 6 Honduras         HND          1990  57.4

# Converting GINI Index to Numeric
GINI <- transform(GINI, GINI_Index = as.numeric(GINI_Index)) 
head(GINI)

##        Country.Name Country.Code Year GINI_Index
## 24          Bolivia          BOL 1990       42.0
## 27           Brazil          BRA 1990       60.5
## 41            Chile          CHL 1990       57.3
## 47       Costa Rica          CRI 1990       45.3
## 59 Egypt, Arab Rep.          EGY 1990       32.0
## 85         Honduras          HND 1990       57.4

#An analysis that could be performed is the trend in the GINI coefficient for selected countries (Giants of the world)

GINI_countries <- GINI %>% filter(Country.Name %in% c('United States', 'India', 'United Kingdom', 'Russian Federation','China'))

#The data doesnt seem to have data for india for more than 1 year. Sorry, Also there is no data for year 2016 and 2017.

#Lets plot the selected countries pattern over the Years.
ggplot(GINI_countries, aes(x = Year, y = GINI_Index)) + geom_jitter(width = 0.5, height = 0.5, aes(color = as.factor(Country.Name))) + geom_line(aes(group = Country.Name), lty = 2, color = "blue") + labs(title = "Subset of Countries: GINI Index Plots", x = "Year", y = "GINI Index")

#Other analysis that could be performed is which countries seem to have the highest GINI coeficient. 
#That would mean the worst equality on walth distribution.
#Lets calculate average GINI coeficient for each country
GINI_avg_per_country <- GINI %>% group_by(Country.Name) %>% summarise(AVG_GINI = mean(GINI_Index)) %>% arrange(desc(AVG_GINI))
kable(head(GINI_avg_per_country))

Country.Name	AVG_GINI
South Africa	61.40000
Namibia	61.00000
Botswana	60.50000
Zambia	56.35000
Central African Republic	56.20000
Colombia	54.65556

From the above analysis we can see the top 5 countries which have the worst GINI coeficient are African countries..

Dataset3:- Game Data.

The data was posed by Valerie Briot. Thre are total 12 Rows in the dataset. The First 10 rows are Free Apps and the last 2 rows are paid apps

Game_data <- read.csv("/Users/ashishsm1986/git/Cuny-Assignments/week6-assignment/Game-Data1.csv", header = TRUE, stringsAsFactors = F, sep = ',', skip = 2)
kable (head(Game_data))

App.Name	Type	Paid.Free	Release.Date	X2010	X2011	X2012	X2013	X2014	X2015	Release.Date.1	X2010.1	X2011.1	X2012.1	X2013.1	X2014.1	X2015.1
Candy Crush Saga	Game	Free	4/12/12	NA	NA	8	56	60	76	11/4/12	NA	NA	2	53	64	72
Fruit Ninja	Game	Free	4/21/10	4	8	58	102	126	148	7/10/10	1	9	64	108	132	165
Angry Birds	Game	Free	12/11/09	10	124	320	547	648	627	11/19/10	2	108	312	538	647	656
Subway Surfers	Game	Free	5/24/12	NA	NA	23	123	202	303	5/24/12	NA	NA	26	128	236	329
Despicable Me: Minion Rush	Game	Free	6/10/13	NA	NA	NA	16	58	128	6/10/13	NA	NA	NA	18	64	294
Clash of Clans	Game	Free	8/2/12	NA	NA	24	123	234	345	10/7/13	NA	NA	NA	4	143	256

# Google Play Apps
google_play <- subset(Game_data, select=c(App.Name, Type, Paid.Free, Release.Date.1, X2010.1, X2011.1, X2012.1, X2013.1, X2014.1, X2015.1))
kable(head(google_play))

App.Name	Type	Paid.Free	Release.Date.1	X2010.1	X2011.1	X2012.1	X2013.1	X2014.1	X2015.1
Candy Crush Saga	Game	Free	11/4/12	NA	NA	2	53	64	72
Fruit Ninja	Game	Free	7/10/10	1	9	64	108	132	165
Angry Birds	Game	Free	11/19/10	2	108	312	538	647	656
Subway Surfers	Game	Free	5/24/12	NA	NA	26	128	236	329
Despicable Me: Minion Rush	Game	Free	6/10/13	NA	NA	NA	18	64	294
Clash of Clans	Game	Free	10/7/13	NA	NA	NA	4	143	256

# Apple Store Apps
apple_store <- subset(Game_data, select=c(App.Name, Type, Paid.Free, Release.Date.1, X2010, X2011, X2012, X2013, X2014, X2015))
kable(head(apple_store))

App.Name	Type	Paid.Free	Release.Date.1	X2010	X2011	X2012	X2013	X2014	X2015
Candy Crush Saga	Game	Free	11/4/12	NA	NA	8	56	60	76
Fruit Ninja	Game	Free	7/10/10	4	8	58	102	126	148
Angry Birds	Game	Free	11/19/10	10	124	320	547	648	627
Subway Surfers	Game	Free	5/24/12	NA	NA	23	123	202	303
Despicable Me: Minion Rush	Game	Free	6/10/13	NA	NA	NA	16	58	128
Clash of Clans	Game	Free	10/7/13	NA	NA	24	123	234	345

#Replacing the NA’s to zero (0)
google_play <- google_play %>% mutate( X2010.1 = ifelse(is.na(X2010.1),0,X2010.1))
google_play <- google_play %>% mutate( X2011.1 = ifelse(is.na(X2011.1),0,X2011.1))
google_play <- google_play %>% mutate( X2012.1 = ifelse(is.na(X2012.1),0,X2012.1))
google_play <- google_play %>% mutate( X2013.1 = ifelse(is.na(X2013.1),0,X2013.1))
google_play <- google_play %>% mutate( X2014.1 = ifelse(is.na(X2014.1),0,X2014.1))
google_play <- google_play %>% mutate( X2015.1 = ifelse(is.na(X2015.1),0,X2015.1))
kable(head(google_play))

App.Name	Type	Paid.Free	Release.Date.1	X2010.1	X2011.1	X2012.1	X2013.1	X2014.1	X2015.1
Candy Crush Saga	Game	Free	11/4/12	0	0	2	53	64	72
Fruit Ninja	Game	Free	7/10/10	1	9	64	108	132	165
Angry Birds	Game	Free	11/19/10	2	108	312	538	647	656
Subway Surfers	Game	Free	5/24/12	0	0	26	128	236	329
Despicable Me: Minion Rush	Game	Free	6/10/13	0	0	0	18	64	294
Clash of Clans	Game	Free	10/7/13	0	0	0	4	143	256

apple_store <- apple_store %>% mutate( X2010 = ifelse(is.na(X2010),0,X2010))
apple_store <- apple_store %>% mutate( X2011 = ifelse(is.na(X2011),0,X2011))
apple_store <- apple_store %>% mutate( X2012 = ifelse(is.na(X2012),0,X2012))
apple_store <- apple_store %>% mutate( X2013 = ifelse(is.na(X2013),0,X2013))
apple_store <- apple_store %>% mutate( X2014 = ifelse(is.na(X2014),0,X2014))
apple_store <- apple_store %>% mutate( X2015 = ifelse(is.na(X2015),0,X2015))
kable(head(apple_store))

App.Name	Type	Paid.Free	Release.Date.1	X2010	X2011	X2012	X2013	X2014	X2015
Candy Crush Saga	Game	Free	11/4/12	0	0	8	56	60	76
Fruit Ninja	Game	Free	7/10/10	4	8	58	102	126	148
Angry Birds	Game	Free	11/19/10	10	124	320	547	648	627
Subway Surfers	Game	Free	5/24/12	0	0	23	123	202	303
Despicable Me: Minion Rush	Game	Free	6/10/13	0	0	0	16	58	128
Clash of Clans	Game	Free	10/7/13	0	0	24	123	234	345

Analysis

Average downloads since the release date for Google Play and Apple Store

kable(google_play %>% summarise_at(.vars = c("X2010.1", "X2011.1", "X2012.1", "X2013.1", "X2014.1", "X2015.1"), .funs = mean))

X2010.1	X2011.1	X2012.1	X2013.1	X2014.1	X2015.1
12.41667	33.83333	81.91667	178.6667	251.75	322.9167

kable(apple_store %>% summarise_at(.vars = c("X2010","X2011","X2012", "X2013","X2014", "X2015"), .funs = mean))

X2010	X2011	X2012	X2013	X2014	X2015
13.08333	34.5	79.91667	179.8333	243.9167	291.9167

Average Paid App Download since the release date for Google Play

kable((google_play %>% filter(Paid.Free == 'Paid') %>% summarise_at(.vars = c("X2010.1","X2011.1","X2012.1", "X2013.1","X2014.1", "X2015.1"), .funs = mean)))

X2010.1	X2011.1	X2012.1	X2013.1	X2014.1	X2015.1
0	4	77	142.5	184.5	348.5

Average Free App Download since the release date for Google Play

kable((google_play %>% filter(Paid.Free == 'Free') %>% summarise_at(.vars = c("X2010.1","X2011.1","X2012.1", "X2013.1","X2014.1", "X2015.1"), .funs = mean)))

X2010.1	X2011.1	X2012.1	X2013.1	X2014.1	X2015.1
14.9	39.8	82.9	185.9	265.2	317.8

Average Paid App Download since the release date for Apple Store

kable((apple_store %>% filter(Paid.Free == 'Paid') %>% summarise_at(.vars = c("X2010","X2011","X2012", "X2013","X2014", "X2015"), .funs = mean)))

X2010	X2011	X2012	X2013	X2014	X2015
0	2	60	120	160	292.5

Average Free App Download since the release date for Apple Store

kable((apple_store %>% filter(Paid.Free == 'Free') %>% summarise_at(.vars = c("X2010","X2011","X2012", "X2013","X2014", "X2015"), .funs = mean)))

X2010	X2011	X2012	X2013	X2014	X2015
15.7	41	83.9	191.8	260.7	291.8

project2-–Tidyr and deployr

Ashish Kumar

3/10/2018