Assignment:-
The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.:

  1. Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets:
    —–Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
    —-Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
    —-Perform the analysis requested in the discussion item.
    —-Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
  2. Please include in your homework submission, for each of the three chosen datasets:
    The URL to the .Rmd file in your GitHub repository, and  The URL for your rpubs.com web page.

Dataset1:- 1984 United States Congressional Voting Records

library(stringr)
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)
library(stringr)
library(ggplot2)
votedata <- read.csv("/Users/ashishsm1986/git/Cuny-Assignments/week6-assignment/house-votes-data.txt", header = F, stringsAsFactors = F)
#Add names to the columns
vd_names <- readLines("/Users/ashishsm1986/git/Cuny-Assignments/week6-assignment/house-votes-names.txt")[33:49]
vd_names <- str_extract(vd_names,"[[:alpha:]-]{2,40}")
vd_names[1] <- "party"
names(votedata) <- vd_names

kable(head(votedata)) #Making sure data is read successfully
party handicapped-infants water-project-cost-sharing adoption-of-the-budget-resolution physician-fee-freeze el-salvador-aid religious-groups-in-schools anti-satellite-test-ban aid-to-nicaraguan-contras mx-missile immigration synfuels-corporation-cutback education-spending superfund-right-to-sue crime duty-free-exports export-administration-act-south-africa
republican n y n y y y n n n y ? y y y n y
republican n y n y y y n n n n n y y y n ?
democrat ? y y ? y y n n n n y n y y n n
democrat n y y n ? y n n n n y n y n n y
democrat y y y n y y n n n n y ? y y y y
democrat n y y n y y n n n n n n y y y y

Tidy the data, Get most number of votes for different agendas

agreement <- votedata %>% 
  gather("issue", "vote", 2:17) %>%
  filter(vote %in% c("y", "n")) %>% 
  group_by(party, issue, vote) %>% 
  summarise(vote_count = n()) %>% 
  spread(vote, vote_count) %>% 
  rename(no_votes = n, yes_votes = y) %>% 
  mutate(total_votes = no_votes + yes_votes, 
         agreement_rate = pmax(no_votes, yes_votes) / total_votes, 
         'got_affirmative_votes?' = ifelse(yes_votes > no_votes, "true", "false")) %>% 
  select(party, issue, agreement_rate, 'got_affirmative_votes?') 


kable(head(agreement)) #Sampling the data
party issue agreement_rate got_affirmative_votes?
democrat adoption-of-the-budget-resolution 0.8884615 true
democrat aid-to-nicaraguan-contras 0.8288973 true
democrat anti-satellite-test-ban 0.7722008 true
democrat crime 0.6498054 false
democrat duty-free-exports 0.6374502 true
democrat education-spending 0.8554217 false

My Analysis:- Overall, On what issue did both parties voted unified

agreement %>% 
  arrange(desc(agreement_rate)) %>% 
  head(1) %>% 
  kable()
party issue agreement_rate got_affirmative_votes?
republican physician-fee-freeze 0.9878788 true

My Analysis:- Overall, On what issue did both parties voted least unified

agreement %>% 
  arrange(agreement_rate) %>% 
  head(1) %>% 
  kable()
party issue agreement_rate got_affirmative_votes?
democrat water-project-cost-sharing 0.5020921 true

Dataset2:- Country Development Indicators.

The data was taken from http://databank.worldbank.org/data/reports.aspx?source=2&series=SI.POV.GINI&country=#
GINI coefficient indicates inequality in wealth distribution, with 0 being complete equality. The bigger the GINI index the worst the wealth distribution.

#import data from csv
gini_data <- read.csv("/Users/ashishsm1986/git/Cuny-Assignments/week6-assignment/GINI_Data.csv", header = TRUE, stringsAsFactors = F)
head(gini_data) #make sure the data is read
##                        Series.Name Series.Code   Country.Name Country.Code
## 1 GINI index (World Bank estimate) SI.POV.GINI    Afghanistan          AFG
## 2 GINI index (World Bank estimate) SI.POV.GINI        Albania          ALB
## 3 GINI index (World Bank estimate) SI.POV.GINI        Algeria          DZA
## 4 GINI index (World Bank estimate) SI.POV.GINI American Samoa          ASM
## 5 GINI index (World Bank estimate) SI.POV.GINI        Andorra          AND
## 6 GINI index (World Bank estimate) SI.POV.GINI         Angola          AGO
##   X1990..YR1990. X2000..YR2000. X2008..YR2008. X2009..YR2009.
## 1             ..             ..             ..             ..
## 2             ..             ..             30             ..
## 3             ..             ..             ..             ..
## 4             ..             ..             ..             ..
## 5             ..             ..             ..             ..
## 6             ..             ..           42.7             ..
##   X2010..YR2010. X2011..YR2011. X2012..YR2012. X2013..YR2013.
## 1             ..             ..             ..             ..
## 2             ..             ..             29             ..
## 3             ..           27.6             ..             ..
## 4             ..             ..             ..             ..
## 5             ..             ..             ..             ..
## 6             ..             ..             ..             ..
##   X2014..YR2014. X2015..YR2015. X2016..YR2016. X2017..YR2017.
## 1             ..             ..             ..             ..
## 2             ..             ..             ..             ..
## 3             ..             ..             ..             ..
## 4             ..             ..             ..             ..
## 5             ..             ..             ..             ..
## 6             ..             ..             ..             ..
#Data is clearly untidy
#We will cleanup the year column names so we can read and make sense out of it.
colnames(gini_data) <- c("Series.Name", "Series.Code", "Country.Name", "Country.Code", 1990, 2000, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017)
head(gini_data)
##                        Series.Name Series.Code   Country.Name Country.Code
## 1 GINI index (World Bank estimate) SI.POV.GINI    Afghanistan          AFG
## 2 GINI index (World Bank estimate) SI.POV.GINI        Albania          ALB
## 3 GINI index (World Bank estimate) SI.POV.GINI        Algeria          DZA
## 4 GINI index (World Bank estimate) SI.POV.GINI American Samoa          ASM
## 5 GINI index (World Bank estimate) SI.POV.GINI        Andorra          AND
## 6 GINI index (World Bank estimate) SI.POV.GINI         Angola          AGO
##   1990 2000 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
## 1   ..   ..   ..   ..   ..   ..   ..   ..   ..   ..   ..   ..
## 2   ..   ..   30   ..   ..   ..   29   ..   ..   ..   ..   ..
## 3   ..   ..   ..   ..   .. 27.6   ..   ..   ..   ..   ..   ..
## 4   ..   ..   ..   ..   ..   ..   ..   ..   ..   ..   ..   ..
## 5   ..   ..   ..   ..   ..   ..   ..   ..   ..   ..   ..   ..
## 6   ..   .. 42.7   ..   ..   ..   ..   ..   ..   ..   ..   ..
#Replace the ".." data values with NA
gini_data[gini_data == '..'] <- NA
head(gini_data) #Much cleaner now!
##                        Series.Name Series.Code   Country.Name Country.Code
## 1 GINI index (World Bank estimate) SI.POV.GINI    Afghanistan          AFG
## 2 GINI index (World Bank estimate) SI.POV.GINI        Albania          ALB
## 3 GINI index (World Bank estimate) SI.POV.GINI        Algeria          DZA
## 4 GINI index (World Bank estimate) SI.POV.GINI American Samoa          ASM
## 5 GINI index (World Bank estimate) SI.POV.GINI        Andorra          AND
## 6 GINI index (World Bank estimate) SI.POV.GINI         Angola          AGO
##   1990 2000 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
## 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA>   30 <NA> <NA> <NA>   29 <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA> 27.6 <NA> <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 6 <NA> <NA> 42.7 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# Removing Series.Name Series.Code
GINI <- gini_data %>% gather(Year, GINI_Index, c(5:16), na.rm = TRUE) %>% group_by(Country.Name) %>% select(-c(Series.Name, Series.Code))
head(GINI)
## # A tibble: 6 x 4
## # Groups:   Country.Name [6]
##   Country.Name     Country.Code Year  GINI_Index
##   <chr>            <chr>        <chr> <chr>     
## 1 Bolivia          BOL          1990  42        
## 2 Brazil           BRA          1990  60.5      
## 3 Chile            CHL          1990  57.3      
## 4 Costa Rica       CRI          1990  45.3      
## 5 Egypt, Arab Rep. EGY          1990  32        
## 6 Honduras         HND          1990  57.4
# Converting GINI Index to Numeric
GINI <- transform(GINI, GINI_Index = as.numeric(GINI_Index)) 
head(GINI)
##        Country.Name Country.Code Year GINI_Index
## 24          Bolivia          BOL 1990       42.0
## 27           Brazil          BRA 1990       60.5
## 41            Chile          CHL 1990       57.3
## 47       Costa Rica          CRI 1990       45.3
## 59 Egypt, Arab Rep.          EGY 1990       32.0
## 85         Honduras          HND 1990       57.4
#An analysis that could be performed is the trend in the GINI coefficient for selected countries (Giants of the world)

GINI_countries <- GINI %>% filter(Country.Name %in% c('United States', 'India', 'United Kingdom', 'Russian Federation','China'))

#The data doesnt seem to have data for india for more than 1 year. Sorry, Also there is no data for year 2016 and 2017.

#Lets plot the selected countries pattern over the Years.
ggplot(GINI_countries, aes(x = Year, y = GINI_Index)) + geom_jitter(width = 0.5, height = 0.5, aes(color = as.factor(Country.Name))) + geom_line(aes(group = Country.Name), lty = 2, color = "blue") + labs(title = "Subset of Countries: GINI Index Plots", x = "Year", y = "GINI Index") 

#Other analysis that could be performed is which countries seem to have the highest GINI coeficient. 
#That would mean the worst equality on walth distribution.
#Lets calculate average GINI coeficient for each country
GINI_avg_per_country <- GINI %>% group_by(Country.Name) %>% summarise(AVG_GINI = mean(GINI_Index)) %>% arrange(desc(AVG_GINI))
kable(head(GINI_avg_per_country))
Country.Name AVG_GINI
South Africa 61.40000
Namibia 61.00000
Botswana 60.50000
Zambia 56.35000
Central African Republic 56.20000
Colombia 54.65556

From the above analysis we can see the top 5 countries which have the worst GINI coeficient are African countries..

Dataset3:- Game Data.

The data was posed by Valerie Briot. Thre are total 12 Rows in the dataset. The First 10 rows are Free Apps and the last 2 rows are paid apps

Game_data <- read.csv("/Users/ashishsm1986/git/Cuny-Assignments/week6-assignment/Game-Data1.csv", header = TRUE, stringsAsFactors = F, sep = ',', skip = 2)
kable (head(Game_data))
App.Name Type Paid.Free Release.Date X2010 X2011 X2012 X2013 X2014 X2015 Release.Date.1 X2010.1 X2011.1 X2012.1 X2013.1 X2014.1 X2015.1
Candy Crush Saga Game Free 4/12/12 NA NA 8 56 60 76 11/4/12 NA NA 2 53 64 72
Fruit Ninja Game Free 4/21/10 4 8 58 102 126 148 7/10/10 1 9 64 108 132 165
Angry Birds Game Free 12/11/09 10 124 320 547 648 627 11/19/10 2 108 312 538 647 656
Subway Surfers Game Free 5/24/12 NA NA 23 123 202 303 5/24/12 NA NA 26 128 236 329
Despicable Me: Minion Rush Game Free 6/10/13 NA NA NA 16 58 128 6/10/13 NA NA NA 18 64 294
Clash of Clans Game Free 8/2/12 NA NA 24 123 234 345 10/7/13 NA NA NA 4 143 256
# Google Play Apps
google_play <- subset(Game_data, select=c(App.Name, Type, Paid.Free, Release.Date.1, X2010.1, X2011.1, X2012.1, X2013.1, X2014.1, X2015.1))
kable(head(google_play))
App.Name Type Paid.Free Release.Date.1 X2010.1 X2011.1 X2012.1 X2013.1 X2014.1 X2015.1
Candy Crush Saga Game Free 11/4/12 NA NA 2 53 64 72
Fruit Ninja Game Free 7/10/10 1 9 64 108 132 165
Angry Birds Game Free 11/19/10 2 108 312 538 647 656
Subway Surfers Game Free 5/24/12 NA NA 26 128 236 329
Despicable Me: Minion Rush Game Free 6/10/13 NA NA NA 18 64 294
Clash of Clans Game Free 10/7/13 NA NA NA 4 143 256
# Apple Store Apps
apple_store <- subset(Game_data, select=c(App.Name, Type, Paid.Free, Release.Date.1, X2010, X2011, X2012, X2013, X2014, X2015))
kable(head(apple_store))
App.Name Type Paid.Free Release.Date.1 X2010 X2011 X2012 X2013 X2014 X2015
Candy Crush Saga Game Free 11/4/12 NA NA 8 56 60 76
Fruit Ninja Game Free 7/10/10 4 8 58 102 126 148
Angry Birds Game Free 11/19/10 10 124 320 547 648 627
Subway Surfers Game Free 5/24/12 NA NA 23 123 202 303
Despicable Me: Minion Rush Game Free 6/10/13 NA NA NA 16 58 128
Clash of Clans Game Free 10/7/13 NA NA 24 123 234 345
#Replacing the NA’s to zero (0)
google_play <- google_play %>% mutate( X2010.1 = ifelse(is.na(X2010.1),0,X2010.1))
google_play <- google_play %>% mutate( X2011.1 = ifelse(is.na(X2011.1),0,X2011.1))
google_play <- google_play %>% mutate( X2012.1 = ifelse(is.na(X2012.1),0,X2012.1))
google_play <- google_play %>% mutate( X2013.1 = ifelse(is.na(X2013.1),0,X2013.1))
google_play <- google_play %>% mutate( X2014.1 = ifelse(is.na(X2014.1),0,X2014.1))
google_play <- google_play %>% mutate( X2015.1 = ifelse(is.na(X2015.1),0,X2015.1))
kable(head(google_play))
App.Name Type Paid.Free Release.Date.1 X2010.1 X2011.1 X2012.1 X2013.1 X2014.1 X2015.1
Candy Crush Saga Game Free 11/4/12 0 0 2 53 64 72
Fruit Ninja Game Free 7/10/10 1 9 64 108 132 165
Angry Birds Game Free 11/19/10 2 108 312 538 647 656
Subway Surfers Game Free 5/24/12 0 0 26 128 236 329
Despicable Me: Minion Rush Game Free 6/10/13 0 0 0 18 64 294
Clash of Clans Game Free 10/7/13 0 0 0 4 143 256
apple_store <- apple_store %>% mutate( X2010 = ifelse(is.na(X2010),0,X2010))
apple_store <- apple_store %>% mutate( X2011 = ifelse(is.na(X2011),0,X2011))
apple_store <- apple_store %>% mutate( X2012 = ifelse(is.na(X2012),0,X2012))
apple_store <- apple_store %>% mutate( X2013 = ifelse(is.na(X2013),0,X2013))
apple_store <- apple_store %>% mutate( X2014 = ifelse(is.na(X2014),0,X2014))
apple_store <- apple_store %>% mutate( X2015 = ifelse(is.na(X2015),0,X2015))
kable(head(apple_store))
App.Name Type Paid.Free Release.Date.1 X2010 X2011 X2012 X2013 X2014 X2015
Candy Crush Saga Game Free 11/4/12 0 0 8 56 60 76
Fruit Ninja Game Free 7/10/10 4 8 58 102 126 148
Angry Birds Game Free 11/19/10 10 124 320 547 648 627
Subway Surfers Game Free 5/24/12 0 0 23 123 202 303
Despicable Me: Minion Rush Game Free 6/10/13 0 0 0 16 58 128
Clash of Clans Game Free 10/7/13 0 0 24 123 234 345

Analysis

Average downloads since the release date for Google Play and Apple Store

kable(google_play %>% summarise_at(.vars = c("X2010.1", "X2011.1", "X2012.1", "X2013.1", "X2014.1", "X2015.1"), .funs = mean))
X2010.1 X2011.1 X2012.1 X2013.1 X2014.1 X2015.1
12.41667 33.83333 81.91667 178.6667 251.75 322.9167
kable(apple_store %>% summarise_at(.vars = c("X2010","X2011","X2012", "X2013","X2014", "X2015"), .funs = mean))
X2010 X2011 X2012 X2013 X2014 X2015
13.08333 34.5 79.91667 179.8333 243.9167 291.9167

Average Paid App Download since the release date for Google Play

kable((google_play %>% filter(Paid.Free == 'Paid') %>% summarise_at(.vars = c("X2010.1","X2011.1","X2012.1", "X2013.1","X2014.1", "X2015.1"), .funs = mean)))
X2010.1 X2011.1 X2012.1 X2013.1 X2014.1 X2015.1
0 4 77 142.5 184.5 348.5

Average Free App Download since the release date for Google Play

kable((google_play %>% filter(Paid.Free == 'Free') %>% summarise_at(.vars = c("X2010.1","X2011.1","X2012.1", "X2013.1","X2014.1", "X2015.1"), .funs = mean)))
X2010.1 X2011.1 X2012.1 X2013.1 X2014.1 X2015.1
14.9 39.8 82.9 185.9 265.2 317.8

Average Paid App Download since the release date for Apple Store

kable((apple_store %>% filter(Paid.Free == 'Paid') %>% summarise_at(.vars = c("X2010","X2011","X2012", "X2013","X2014", "X2015"), .funs = mean)))
X2010 X2011 X2012 X2013 X2014 X2015
0 2 60 120 160 292.5

Average Free App Download since the release date for Apple Store

kable((apple_store %>% filter(Paid.Free == 'Free') %>% summarise_at(.vars = c("X2010","X2011","X2012", "X2013","X2014", "X2015"), .funs = mean)))
X2010 X2011 X2012 X2013 X2014 X2015
15.7 41 83.9 191.8 260.7 291.8