Data 606 Project Proposal

# Load the Flag dataset from UCI's Machine Learning Data Repository
flag_df <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data"),header=F)
#Rename data fram V columns

#give a descriptive name to selected columns
library(plyr)
# data set meta data below
# source - https://archive.ics.uci.edu/ml/datasets/Flags
#1. name:   Name of the country concerned 
#2. landmass:   1=N.America, 2=S.America, 3=Europe, 4=Africa, 4=Asia, 6=Oceania 
#3. zone:   Geographic quadrant, based on Greenwich and the Equator; 1=NE, 2=SE, 
#  3=SW, 4=NW 
#4. area:   in thousands of square km 
#5. population: in round millions 
#6. language: 1=English, 2=Spanish, 3=French, 4=German, 5=Slavic, 6=Other #Indo-European, 7=Chinese, 8=Arabic, 9=Japanese/Turkish/Finnish/Magyar, 10=Others 
#7. religion: 0=Catholic, 1=Other Christian, 2=Muslim, 3=Buddhist, 4=Hindu, #5=Ethnic, 6=Marxist, 7=Others 
#8. bars: Number of vertical bars in the flag 
#9. stripes: Number of horizontal stripes in the flag 
#10. colours: Number of different colours in the flag 
#11. red: 0 if red absent, 1 if red present in the flag 
#12. green: same for green 
#13. blue: same for blue 
#14. gold: same for gold (also yellow) 
#15. white: same for white 
#16. black: same for black 
#17. orange: same for orange (also brown) 
#18. mainhue: predominant colour in the flag (tie-breaks decided by taking the #topmost hue, if that fails then the most central hue, and if that fails the #leftmost hue) 
#19. circles: Number of circles in the flag 
#20. crosses: Number of (upright) crosses 
#21. saltires: Number of diagonal crosses 
#22. quarters: Number of quartered sections 
#23. sunstars: Number of sun or star symbols 
#24. crescent: 1 if a crescent moon symbol present, else 0 
#25. triangle: 1 if any triangles present, 0 otherwise 
#26. icon: 1 if an inanimate image present (e.g., a boat), otherwise 0 
#27. animate: 1 if an animate image (e.g., an eagle, a tree, a human hand) #present, 0 otherwise 
#28. text: 1 if any letters or writing on the flag (e.g., a motto or slogan), 0 #otherwise 
#29. topleft: colour in the top-left corner (moving right to decide tie-breaks) 
#30. botright: Colour in the bottom-left corner (moving left to decide #tie-breaks)

flag_df <- rename(flag_df, c("V1"="country", "V2"="continent", "V3"="zone", "V4"="area", "V5"="population", "V6"="language", "V7"="religion",  "V8"="bars", "V9"="stripes", "V10"="colors", "V11"="red", "V12"="green", "V13"="blue", "V14"="gold", "V15"="white", "V16"="black", "V17"="orange", "V18"="mainhue", "V19"="circles", "V20"="crosses", "V21"="saltires", "V22"="quarters", "V23"="sunstarts", "V24"="crescent", "V25"="triangle", "V26"="icon", "V27"="animate", "V28"="text",  "V29"="topleft", "V30"="topright"))

View(flag_df)
nrow(flag_df)

## [1] 194

Research question

Is there a correlation between a country’s official or predominant religion and the color and shape of its flag?

Cases

What are the cases, and how many are there?

There are 194 cases representing flag data attributes from 194 countries.

Data collection

Describe the method of data collection.

Data collection was by direct observational study.

Type of study

What type of study is this (observational/experiment)?

This is an observational study.

Data Source

The dataset for this study was extracted from the University of California Irvine’s Machine Learning Repository. The dataset was created by Richard S. Forsyth. Mr. Forsyth collected data for the flag dataset primarily from “Collins Gem Guide to Flags”: Collins Publishers (1986).

Link - https://archive.ics.uci.edu/ml/datasets/Flags

Response

What is the response variable, and what type is it (numerical/categorical)?

The response variable in this study is the country’s official or predominant religion. Religion in this study is a categorical variable.

Explanatory

What is the explanatory variable, and what type is it (numerical/categorical)?

The explanatory variables in this study are the shape, color and size of a country’s flag. Shape and color are catergorical variables. Size is a numerical variable.

Relevant summary statistics

Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

#make sure the Wickham libraries are loaded
library(tidyr)

## Warning: package 'tidyr' was built under R version 3.2.3

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.2.3

#group the countries by religion and get total count for each
rel_cnt= flag_df %>% group_by(religion) %>% summarise(n = n())

#add religion name column to re_cnt data frame
rel_cnt$rel_name[rel_cnt$religion==0]<-"Catholic"
rel_cnt$rel_name[rel_cnt$religion==1]<-"Other Christian"
rel_cnt$rel_name[rel_cnt$religion==2]<-"Muslim"
rel_cnt$rel_name[rel_cnt$religion==3]<-"Buddhist"
rel_cnt$rel_name[rel_cnt$religion==4]<-"Hindu"
rel_cnt$rel_name[rel_cnt$religion==5]<-"Ethnic"
rel_cnt$rel_name[rel_cnt$religion==6]<-"Marxist"
rel_cnt$rel_name[rel_cnt$religion==7]<-"Others"

rel_cnt

## Source: local data frame [8 x 3]
## 
##   religion     n        rel_name
##      (int) (int)           (chr)
## 1        0    40        Catholic
## 2        1    60 Other Christian
## 3        2    36          Muslim
## 4        3     8        Buddhist
## 5        4     4           Hindu
## 6        5    27          Ethnic
## 7        6    15         Marxist
## 8        7     4          Others

#mean and standard deviation of the number of colors in a country's flag group by religion
flgcol_cnt_rel = flag_df %>% group_by(religion) %>% summarise(mean(colors), sd(colors))

flgcol_cnt_rel

## Source: local data frame [8 x 3]
## 
##   religion mean(colors) sd(colors)
##      (int)        (dbl)      (dbl)
## 1        0     3.175000  1.1521997
## 2        1     3.916667  1.6186589
## 3        2     3.000000  0.9856108
## 4        3     3.375000  1.3024702
## 5        4     4.000000  0.8164966
## 6        5     3.666667  0.7844645
## 7        6     3.200000  1.3201731
## 8        7     3.000000  1.1547005

#mean and standard deviation of the number of circles in a country's flag group by religion
flgcirc_cnt_rel = flag_df %>% group_by(religion) %>% summarise(mean(circles), sd(circles))

flgcirc_cnt_rel

## Source: local data frame [8 x 3]
## 
##   religion mean(circles) sd(circles)
##      (int)         (dbl)       (dbl)
## 1        0    0.10000000   0.3038218
## 2        1    0.15000000   0.3600847
## 3        2    0.08333333   0.2803060
## 4        3    0.75000000   1.3887301
## 5        4    0.25000000   0.5000000
## 6        5    0.14814815   0.3620140
## 7        6    0.26666667   0.5936168
## 8        7    0.50000000   0.5773503

other shapes (such as crosses, sun and stars, etc.) and specific colors (red, blue, green, etc) will also be analyzed in this study using correlation techniques to determine if these variables “explain” a country’s official religion.