Assignment:-
The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.:
- Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets:
—–Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
—-Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
—-Perform the analysis requested in the discussion item.
—-Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. - Please include in your homework submission, for each of the three chosen datasets:
The URL to the .Rmd file in your GitHub repository, and The URL for your rpubs.com web page.
Dataset1:- 1984 United States Congressional Voting Records
library(stringr)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
library(stringr)
library(ggplot2)
votedata <- read.csv("/Users/ashishsm1986/git/Cuny-Assignments/week6-assignment/house-votes-data.txt", header = F, stringsAsFactors = F)
#Add names to the columns
vd_names <- readLines("/Users/ashishsm1986/git/Cuny-Assignments/week6-assignment/house-votes-names.txt")[33:49]
vd_names <- str_extract(vd_names,"[[:alpha:]-]{2,40}")
vd_names[1] <- "party"
names(votedata) <- vd_names
kable(head(votedata)) #Making sure data is read successfully
party | handicapped-infants | water-project-cost-sharing | adoption-of-the-budget-resolution | physician-fee-freeze | el-salvador-aid | religious-groups-in-schools | anti-satellite-test-ban | aid-to-nicaraguan-contras | mx-missile | immigration | synfuels-corporation-cutback | education-spending | superfund-right-to-sue | crime | duty-free-exports | export-administration-act-south-africa |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
republican | n | y | n | y | y | y | n | n | n | y | ? | y | y | y | n | y |
republican | n | y | n | y | y | y | n | n | n | n | n | y | y | y | n | ? |
democrat | ? | y | y | ? | y | y | n | n | n | n | y | n | y | y | n | n |
democrat | n | y | y | n | ? | y | n | n | n | n | y | n | y | n | n | y |
democrat | y | y | y | n | y | y | n | n | n | n | y | ? | y | y | y | y |
democrat | n | y | y | n | y | y | n | n | n | n | n | n | y | y | y | y |
Tidy the data, Get most number of votes for different agendas
agreement <- votedata %>%
gather("issue", "vote", 2:17) %>%
filter(vote %in% c("y", "n")) %>%
group_by(party, issue, vote) %>%
summarise(vote_count = n()) %>%
spread(vote, vote_count) %>%
rename(no_votes = n, yes_votes = y) %>%
mutate(total_votes = no_votes + yes_votes,
agreement_rate = pmax(no_votes, yes_votes) / total_votes,
'got_affirmative_votes?' = ifelse(yes_votes > no_votes, "true", "false")) %>%
select(party, issue, agreement_rate, 'got_affirmative_votes?')
kable(head(agreement)) #Sampling the data
party | issue | agreement_rate | got_affirmative_votes? |
---|---|---|---|
democrat | adoption-of-the-budget-resolution | 0.8884615 | true |
democrat | aid-to-nicaraguan-contras | 0.8288973 | true |
democrat | anti-satellite-test-ban | 0.7722008 | true |
democrat | crime | 0.6498054 | false |
democrat | duty-free-exports | 0.6374502 | true |
democrat | education-spending | 0.8554217 | false |
My Analysis:- Overall, On what issue did both parties voted unified
agreement %>%
arrange(desc(agreement_rate)) %>%
head(1) %>%
kable()
party | issue | agreement_rate | got_affirmative_votes? |
---|---|---|---|
republican | physician-fee-freeze | 0.9878788 | true |
My Analysis:- Overall, On what issue did both parties voted least unified
agreement %>%
arrange(agreement_rate) %>%
head(1) %>%
kable()
party | issue | agreement_rate | got_affirmative_votes? |
---|---|---|---|
democrat | water-project-cost-sharing | 0.5020921 | true |
Dataset2:- Country Development Indicators.
The data was taken from http://databank.worldbank.org/data/reports.aspx?source=2&series=SI.POV.GINI&country=#
GINI coefficient indicates inequality in wealth distribution, with 0 being complete equality. The bigger the GINI index the worst the wealth distribution.
#import data from csv
gini_data <- read.csv("/Users/ashishsm1986/git/Cuny-Assignments/week6-assignment/GINI_Data.csv", header = TRUE, stringsAsFactors = F)
head(gini_data) #make sure the data is read
## Series.Name Series.Code Country.Name Country.Code
## 1 GINI index (World Bank estimate) SI.POV.GINI Afghanistan AFG
## 2 GINI index (World Bank estimate) SI.POV.GINI Albania ALB
## 3 GINI index (World Bank estimate) SI.POV.GINI Algeria DZA
## 4 GINI index (World Bank estimate) SI.POV.GINI American Samoa ASM
## 5 GINI index (World Bank estimate) SI.POV.GINI Andorra AND
## 6 GINI index (World Bank estimate) SI.POV.GINI Angola AGO
## X1990..YR1990. X2000..YR2000. X2008..YR2008. X2009..YR2009.
## 1 .. .. .. ..
## 2 .. .. 30 ..
## 3 .. .. .. ..
## 4 .. .. .. ..
## 5 .. .. .. ..
## 6 .. .. 42.7 ..
## X2010..YR2010. X2011..YR2011. X2012..YR2012. X2013..YR2013.
## 1 .. .. .. ..
## 2 .. .. 29 ..
## 3 .. 27.6 .. ..
## 4 .. .. .. ..
## 5 .. .. .. ..
## 6 .. .. .. ..
## X2014..YR2014. X2015..YR2015. X2016..YR2016. X2017..YR2017.
## 1 .. .. .. ..
## 2 .. .. .. ..
## 3 .. .. .. ..
## 4 .. .. .. ..
## 5 .. .. .. ..
## 6 .. .. .. ..
#Data is clearly untidy
#We will cleanup the year column names so we can read and make sense out of it.
colnames(gini_data) <- c("Series.Name", "Series.Code", "Country.Name", "Country.Code", 1990, 2000, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017)
head(gini_data)
## Series.Name Series.Code Country.Name Country.Code
## 1 GINI index (World Bank estimate) SI.POV.GINI Afghanistan AFG
## 2 GINI index (World Bank estimate) SI.POV.GINI Albania ALB
## 3 GINI index (World Bank estimate) SI.POV.GINI Algeria DZA
## 4 GINI index (World Bank estimate) SI.POV.GINI American Samoa ASM
## 5 GINI index (World Bank estimate) SI.POV.GINI Andorra AND
## 6 GINI index (World Bank estimate) SI.POV.GINI Angola AGO
## 1990 2000 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
## 1 .. .. .. .. .. .. .. .. .. .. .. ..
## 2 .. .. 30 .. .. .. 29 .. .. .. .. ..
## 3 .. .. .. .. .. 27.6 .. .. .. .. .. ..
## 4 .. .. .. .. .. .. .. .. .. .. .. ..
## 5 .. .. .. .. .. .. .. .. .. .. .. ..
## 6 .. .. 42.7 .. .. .. .. .. .. .. .. ..
#Replace the ".." data values with NA
gini_data[gini_data == '..'] <- NA
head(gini_data) #Much cleaner now!
## Series.Name Series.Code Country.Name Country.Code
## 1 GINI index (World Bank estimate) SI.POV.GINI Afghanistan AFG
## 2 GINI index (World Bank estimate) SI.POV.GINI Albania ALB
## 3 GINI index (World Bank estimate) SI.POV.GINI Algeria DZA
## 4 GINI index (World Bank estimate) SI.POV.GINI American Samoa ASM
## 5 GINI index (World Bank estimate) SI.POV.GINI Andorra AND
## 6 GINI index (World Bank estimate) SI.POV.GINI Angola AGO
## 1990 2000 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
## 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> 30 <NA> <NA> <NA> 29 <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA> 27.6 <NA> <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 6 <NA> <NA> 42.7 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# Removing Series.Name Series.Code
GINI <- gini_data %>% gather(Year, GINI_Index, c(5:16), na.rm = TRUE) %>% group_by(Country.Name) %>% select(-c(Series.Name, Series.Code))
head(GINI)
## # A tibble: 6 x 4
## # Groups: Country.Name [6]
## Country.Name Country.Code Year GINI_Index
## <chr> <chr> <chr> <chr>
## 1 Bolivia BOL 1990 42
## 2 Brazil BRA 1990 60.5
## 3 Chile CHL 1990 57.3
## 4 Costa Rica CRI 1990 45.3
## 5 Egypt, Arab Rep. EGY 1990 32
## 6 Honduras HND 1990 57.4
# Converting GINI Index to Numeric
GINI <- transform(GINI, GINI_Index = as.numeric(GINI_Index))
head(GINI)
## Country.Name Country.Code Year GINI_Index
## 24 Bolivia BOL 1990 42.0
## 27 Brazil BRA 1990 60.5
## 41 Chile CHL 1990 57.3
## 47 Costa Rica CRI 1990 45.3
## 59 Egypt, Arab Rep. EGY 1990 32.0
## 85 Honduras HND 1990 57.4
#An analysis that could be performed is the trend in the GINI coefficient for selected countries (Giants of the world)
GINI_countries <- GINI %>% filter(Country.Name %in% c('United States', 'India', 'United Kingdom', 'Russian Federation','China'))
#The data doesnt seem to have data for india for more than 1 year. Sorry, Also there is no data for year 2016 and 2017.
#Lets plot the selected countries pattern over the Years.
ggplot(GINI_countries, aes(x = Year, y = GINI_Index)) + geom_jitter(width = 0.5, height = 0.5, aes(color = as.factor(Country.Name))) + geom_line(aes(group = Country.Name), lty = 2, color = "blue") + labs(title = "Subset of Countries: GINI Index Plots", x = "Year", y = "GINI Index")
#Other analysis that could be performed is which countries seem to have the highest GINI coeficient.
#That would mean the worst equality on walth distribution.
#Lets calculate average GINI coeficient for each country
GINI_avg_per_country <- GINI %>% group_by(Country.Name) %>% summarise(AVG_GINI = mean(GINI_Index)) %>% arrange(desc(AVG_GINI))
kable(head(GINI_avg_per_country))
Country.Name | AVG_GINI |
---|---|
South Africa | 61.40000 |
Namibia | 61.00000 |
Botswana | 60.50000 |
Zambia | 56.35000 |
Central African Republic | 56.20000 |
Colombia | 54.65556 |
From the above analysis we can see the top 5 countries which have the worst GINI coeficient are African countries..
Dataset3:- Game Data.
The data was posed by Valerie Briot. Thre are total 12 Rows in the dataset. The First 10 rows are Free Apps and the last 2 rows are paid apps
Game_data <- read.csv("/Users/ashishsm1986/git/Cuny-Assignments/week6-assignment/Game-Data1.csv", header = TRUE, stringsAsFactors = F, sep = ',', skip = 2)
kable (head(Game_data))
App.Name | Type | Paid.Free | Release.Date | X2010 | X2011 | X2012 | X2013 | X2014 | X2015 | Release.Date.1 | X2010.1 | X2011.1 | X2012.1 | X2013.1 | X2014.1 | X2015.1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Candy Crush Saga | Game | Free | 4/12/12 | NA | NA | 8 | 56 | 60 | 76 | 11/4/12 | NA | NA | 2 | 53 | 64 | 72 |
Fruit Ninja | Game | Free | 4/21/10 | 4 | 8 | 58 | 102 | 126 | 148 | 7/10/10 | 1 | 9 | 64 | 108 | 132 | 165 |
Angry Birds | Game | Free | 12/11/09 | 10 | 124 | 320 | 547 | 648 | 627 | 11/19/10 | 2 | 108 | 312 | 538 | 647 | 656 |
Subway Surfers | Game | Free | 5/24/12 | NA | NA | 23 | 123 | 202 | 303 | 5/24/12 | NA | NA | 26 | 128 | 236 | 329 |
Despicable Me: Minion Rush | Game | Free | 6/10/13 | NA | NA | NA | 16 | 58 | 128 | 6/10/13 | NA | NA | NA | 18 | 64 | 294 |
Clash of Clans | Game | Free | 8/2/12 | NA | NA | 24 | 123 | 234 | 345 | 10/7/13 | NA | NA | NA | 4 | 143 | 256 |
# Google Play Apps
google_play <- subset(Game_data, select=c(App.Name, Type, Paid.Free, Release.Date.1, X2010.1, X2011.1, X2012.1, X2013.1, X2014.1, X2015.1))
kable(head(google_play))
App.Name | Type | Paid.Free | Release.Date.1 | X2010.1 | X2011.1 | X2012.1 | X2013.1 | X2014.1 | X2015.1 |
---|---|---|---|---|---|---|---|---|---|
Candy Crush Saga | Game | Free | 11/4/12 | NA | NA | 2 | 53 | 64 | 72 |
Fruit Ninja | Game | Free | 7/10/10 | 1 | 9 | 64 | 108 | 132 | 165 |
Angry Birds | Game | Free | 11/19/10 | 2 | 108 | 312 | 538 | 647 | 656 |
Subway Surfers | Game | Free | 5/24/12 | NA | NA | 26 | 128 | 236 | 329 |
Despicable Me: Minion Rush | Game | Free | 6/10/13 | NA | NA | NA | 18 | 64 | 294 |
Clash of Clans | Game | Free | 10/7/13 | NA | NA | NA | 4 | 143 | 256 |
# Apple Store Apps
apple_store <- subset(Game_data, select=c(App.Name, Type, Paid.Free, Release.Date.1, X2010, X2011, X2012, X2013, X2014, X2015))
kable(head(apple_store))
App.Name | Type | Paid.Free | Release.Date.1 | X2010 | X2011 | X2012 | X2013 | X2014 | X2015 |
---|---|---|---|---|---|---|---|---|---|
Candy Crush Saga | Game | Free | 11/4/12 | NA | NA | 8 | 56 | 60 | 76 |
Fruit Ninja | Game | Free | 7/10/10 | 4 | 8 | 58 | 102 | 126 | 148 |
Angry Birds | Game | Free | 11/19/10 | 10 | 124 | 320 | 547 | 648 | 627 |
Subway Surfers | Game | Free | 5/24/12 | NA | NA | 23 | 123 | 202 | 303 |
Despicable Me: Minion Rush | Game | Free | 6/10/13 | NA | NA | NA | 16 | 58 | 128 |
Clash of Clans | Game | Free | 10/7/13 | NA | NA | 24 | 123 | 234 | 345 |
#Replacing the NA’s to zero (0)
google_play <- google_play %>% mutate( X2010.1 = ifelse(is.na(X2010.1),0,X2010.1))
google_play <- google_play %>% mutate( X2011.1 = ifelse(is.na(X2011.1),0,X2011.1))
google_play <- google_play %>% mutate( X2012.1 = ifelse(is.na(X2012.1),0,X2012.1))
google_play <- google_play %>% mutate( X2013.1 = ifelse(is.na(X2013.1),0,X2013.1))
google_play <- google_play %>% mutate( X2014.1 = ifelse(is.na(X2014.1),0,X2014.1))
google_play <- google_play %>% mutate( X2015.1 = ifelse(is.na(X2015.1),0,X2015.1))
kable(head(google_play))
App.Name | Type | Paid.Free | Release.Date.1 | X2010.1 | X2011.1 | X2012.1 | X2013.1 | X2014.1 | X2015.1 |
---|---|---|---|---|---|---|---|---|---|
Candy Crush Saga | Game | Free | 11/4/12 | 0 | 0 | 2 | 53 | 64 | 72 |
Fruit Ninja | Game | Free | 7/10/10 | 1 | 9 | 64 | 108 | 132 | 165 |
Angry Birds | Game | Free | 11/19/10 | 2 | 108 | 312 | 538 | 647 | 656 |
Subway Surfers | Game | Free | 5/24/12 | 0 | 0 | 26 | 128 | 236 | 329 |
Despicable Me: Minion Rush | Game | Free | 6/10/13 | 0 | 0 | 0 | 18 | 64 | 294 |
Clash of Clans | Game | Free | 10/7/13 | 0 | 0 | 0 | 4 | 143 | 256 |
apple_store <- apple_store %>% mutate( X2010 = ifelse(is.na(X2010),0,X2010))
apple_store <- apple_store %>% mutate( X2011 = ifelse(is.na(X2011),0,X2011))
apple_store <- apple_store %>% mutate( X2012 = ifelse(is.na(X2012),0,X2012))
apple_store <- apple_store %>% mutate( X2013 = ifelse(is.na(X2013),0,X2013))
apple_store <- apple_store %>% mutate( X2014 = ifelse(is.na(X2014),0,X2014))
apple_store <- apple_store %>% mutate( X2015 = ifelse(is.na(X2015),0,X2015))
kable(head(apple_store))
App.Name | Type | Paid.Free | Release.Date.1 | X2010 | X2011 | X2012 | X2013 | X2014 | X2015 |
---|---|---|---|---|---|---|---|---|---|
Candy Crush Saga | Game | Free | 11/4/12 | 0 | 0 | 8 | 56 | 60 | 76 |
Fruit Ninja | Game | Free | 7/10/10 | 4 | 8 | 58 | 102 | 126 | 148 |
Angry Birds | Game | Free | 11/19/10 | 10 | 124 | 320 | 547 | 648 | 627 |
Subway Surfers | Game | Free | 5/24/12 | 0 | 0 | 23 | 123 | 202 | 303 |
Despicable Me: Minion Rush | Game | Free | 6/10/13 | 0 | 0 | 0 | 16 | 58 | 128 |
Clash of Clans | Game | Free | 10/7/13 | 0 | 0 | 24 | 123 | 234 | 345 |
Analysis
Average downloads since the release date for Google Play and Apple Store
kable(google_play %>% summarise_at(.vars = c("X2010.1", "X2011.1", "X2012.1", "X2013.1", "X2014.1", "X2015.1"), .funs = mean))
X2010.1 | X2011.1 | X2012.1 | X2013.1 | X2014.1 | X2015.1 |
---|---|---|---|---|---|
12.41667 | 33.83333 | 81.91667 | 178.6667 | 251.75 | 322.9167 |
kable(apple_store %>% summarise_at(.vars = c("X2010","X2011","X2012", "X2013","X2014", "X2015"), .funs = mean))
X2010 | X2011 | X2012 | X2013 | X2014 | X2015 |
---|---|---|---|---|---|
13.08333 | 34.5 | 79.91667 | 179.8333 | 243.9167 | 291.9167 |
Average Paid App Download since the release date for Google Play
kable((google_play %>% filter(Paid.Free == 'Paid') %>% summarise_at(.vars = c("X2010.1","X2011.1","X2012.1", "X2013.1","X2014.1", "X2015.1"), .funs = mean)))
X2010.1 | X2011.1 | X2012.1 | X2013.1 | X2014.1 | X2015.1 |
---|---|---|---|---|---|
0 | 4 | 77 | 142.5 | 184.5 | 348.5 |
Average Free App Download since the release date for Google Play
kable((google_play %>% filter(Paid.Free == 'Free') %>% summarise_at(.vars = c("X2010.1","X2011.1","X2012.1", "X2013.1","X2014.1", "X2015.1"), .funs = mean)))
X2010.1 | X2011.1 | X2012.1 | X2013.1 | X2014.1 | X2015.1 |
---|---|---|---|---|---|
14.9 | 39.8 | 82.9 | 185.9 | 265.2 | 317.8 |
Average Paid App Download since the release date for Apple Store
kable((apple_store %>% filter(Paid.Free == 'Paid') %>% summarise_at(.vars = c("X2010","X2011","X2012", "X2013","X2014", "X2015"), .funs = mean)))
X2010 | X2011 | X2012 | X2013 | X2014 | X2015 |
---|---|---|---|---|---|
0 | 2 | 60 | 120 | 160 | 292.5 |
Average Free App Download since the release date for Apple Store
kable((apple_store %>% filter(Paid.Free == 'Free') %>% summarise_at(.vars = c("X2010","X2011","X2012", "X2013","X2014", "X2015"), .funs = mean)))
X2010 | X2011 | X2012 | X2013 | X2014 | X2015 |
---|---|---|---|---|---|
15.7 | 41 | 83.9 | 191.8 | 260.7 | 291.8 |