607 Week 3 R Assignment

Your task is to choose one dataset, then study the data and its associated description of the data (i.e. “data dictionary”). You should take the data, and create an R data frame with a subset of the columns (and if you like rows) in the dataset. Your deliverable is the R code to perform these transformation tasks.

I chose the flag data from UCI data sets. It contains the country names, and their various flag attributes. I have taken a subset of Eurpe and Africa/Asia landmass regions, and columns that contain number of bars and strips within a flag.

library(dplyr) # Selecting and filtering
library(magrittr) # Pipes
library(knitr) # kable

# Read the flags.data file online
flags_fulldump_df <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data", header= FALSE, sep=",")

# Add column names (obtained from data dictionary)
colnames(flags_fulldump_df) <- c("name", "landmass", "zone", 
                                 "area", "population", "language", 
                                 "religion", "bars", "stripes", 
                                 "colours", "red", "green", 
                                 "blue", "gold", "white", 
                                 "black", "orange", "mainhue", 
                                 "circles", "crosses", "saltires",
                                 "quarters", "sunstars", "crescent", 
                                 "triangle", "icon", "animate", 
                                 "text", "topleft", "botright")

# Find the structure of flags data frame
str(flags_fulldump_df)

## 'data.frame':    194 obs. of  30 variables:
##  $ name      : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ landmass  : int  5 3 4 6 3 4 1 1 2 2 ...
##  $ zone      : int  1 1 1 3 1 2 4 4 3 3 ...
##  $ area      : int  648 29 2388 0 0 1247 0 0 2777 2777 ...
##  $ population: int  16 3 20 0 0 7 0 0 28 28 ...
##  $ language  : int  10 6 8 1 6 10 1 1 2 2 ...
##  $ religion  : int  2 6 2 1 0 5 1 1 0 0 ...
##  $ bars      : int  0 0 2 0 3 0 0 0 0 0 ...
##  $ stripes   : int  3 0 0 0 0 2 1 1 3 3 ...
##  $ colours   : int  5 3 3 5 3 3 3 5 2 3 ...
##  $ red       : int  1 1 1 1 1 1 0 1 0 0 ...
##  $ green     : int  1 0 1 0 0 0 0 0 0 0 ...
##  $ blue      : int  0 0 0 1 1 0 1 1 1 1 ...
##  $ gold      : int  1 1 0 1 1 1 0 1 0 1 ...
##  $ white     : int  1 0 1 1 0 0 1 1 1 1 ...
##  $ black     : int  1 1 0 0 0 1 0 1 0 0 ...
##  $ orange    : int  0 0 0 1 0 0 1 0 0 0 ...
##  $ mainhue   : Factor w/ 8 levels "black","blue",..: 5 7 5 2 4 7 8 7 2 2 ...
##  $ circles   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ crosses   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ saltires  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ quarters  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sunstars  : int  1 1 1 0 0 1 0 1 0 1 ...
##  $ crescent  : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ triangle  : int  0 0 0 1 0 0 0 1 0 0 ...
##  $ icon      : int  1 0 0 1 0 1 0 0 0 0 ...
##  $ animate   : int  0 1 0 1 0 0 1 0 0 0 ...
##  $ text      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ topleft   : Factor w/ 7 levels "black","blue",..: 1 6 4 2 2 6 7 1 2 2 ...
##  $ botright  : Factor w/ 8 levels "black","blue",..: 5 7 8 7 7 1 2 7 2 2 ...

# Display few lines of df
kable(head (flags_fulldump_df))

name	landmass	zone	area	population	language	religion	bars	stripes	colours	red	green	blue	gold	white	black	orange	mainhue	sunstars	crescent	triangle	icon	animate	topleft	botright
Afghanistan	5	1	648	16	10	2	0	3	5	1	1	0	1	1	1	0	green	1	0	0	1	0	black	green
Albania	3	1	29	3	6	6	0	0	3	1	0	0	1	0	1	0	red	1	0	0	0	1	red	red
Algeria	4	1	2388	20	8	2	2	0	3	1	1	0	0	1	0	0	green	1	1	0	0	0	green	white
American-Samoa	6	3	0	0	1	1	0	0	5	1	0	1	1	1	0	1	blue	0	0	1	1	1	blue	red
Andorra	3	1	0	0	6	0	3	0	3	1	0	1	1	0	0	0	gold	0	0	0	0	0	blue	red
Angola	4	2	1247	7	10	5	0	2	3	1	0	0	1	0	1	0	red	1	0	0	1	0	red	black

# Create subset containing columns: landmass(Europe & Africa/Asia), country name, bars and stripes
flags_subset_df <- flags_fulldump_df %>%
                   select(landmass, name, bars, stripes) %>%
                   filter(landmass %in% c(3,4))

kable(head(flags_subset_df))

landmass	name	bars	stripes
3	Albania	0	0
4	Algeria	2	0
3	Andorra	3	0
4	Angola	0	2
3	Austria	0	3
3	Belgium	3	0

# Set all zero values to NA so that they will not counted when aggregating count of bars and stripes
flags_subset_df[flags_subset_df == 0] <- NA
kable(head(flags_subset_df))

landmass	name	bars	stripes
3	Albania	NA	NA
4	Algeria	2	NA
3	Andorra	3	NA
4	Angola	NA	2
3	Austria	NA	3
3	Belgium	3	NA

# Group the values by Europe and Africa/Asia, then count the bars and stripes
flags_sum_df <- flags_subset_df %>%
                   group_by(landmass) %>%
                   summarise(totalCountries = sum(!is.na(name)),
                            totalBars = sum(!is.na(bars)),
                            totalStripes=sum(!is.na(stripes)))

kable(head(flags_sum_df))

landmass	totalCountries	totalBars	totalStripes
3	35	8	15
4	52	15	26

# Add percentages
flags_percent_df <- flags_sum_df %>%
                mutate(percentBars = round(totalBars/totalCountries,digits=2),
                          percentStripes = round(totalStripes/totalCountries,digits=2),
                       percentNone = round((totalCountries-(totalBars+totalStripes))/totalCountries, digits=2))

# Add rownames
rownames(flags_percent_df) <- c("Europe", "Africa/Asia")
#kable(flags_percent_df[,c(5,6,7)])
kable(flags_percent_df %>% select(percentBars, percentStripes, percentNone))

	percentBars	percentStripes	percentNone
Europe	0.23	0.43	0.34
Africa/Asia	0.29	0.50	0.21

** It appears stripes are common in majority of Europe and Africa/Asia country flags. **

607 Week 3 R Assignment

Puneet Auluck

September 12, 2015