Load Libraries:
library(RMySQL)
library(rworldmap)
library(RCurl)
library(stringr)
library(tidyr)
library(dplyr)
library(ggplot2)
library(psych)
Measuring the wealth distribution between the people in each country has been something economists have been measuring for many years. In the GINI index, a higher GINI coefficient signifies inequality in wealth distribution, with 1 being complete inequality and 0 being complete equality.
The World Bank has been maintaining this data.
http://databank.worldbank.org/data/reports.aspx?source=2&series=SI.POV.GINI&country=#
Import dataset from a .csv file.
GINI.rawfile <- read.csv("https://raw.githubusercontent.com/jcp9010/MSDA/master/GINI%20Index.csv", header = TRUE)
head(GINI.rawfile)
## Series.Name Series.Code Country.Name Country.Code
## 1 GINI index (World Bank estimate) SI.POV.GINI Afghanistan AFG
## 2 GINI index (World Bank estimate) SI.POV.GINI Albania ALB
## 3 GINI index (World Bank estimate) SI.POV.GINI Algeria DZA
## 4 GINI index (World Bank estimate) SI.POV.GINI American Samoa ASM
## 5 GINI index (World Bank estimate) SI.POV.GINI Andorra ADO
## 6 GINI index (World Bank estimate) SI.POV.GINI Angola AGO
## X1990..YR1990. X2000..YR2000. X2007..YR2007. X2008..YR2008.
## 1 .. .. .. ..
## 2 .. .. .. 29.98
## 3 .. .. .. ..
## 4 .. .. .. ..
## 5 .. .. .. ..
## 6 .. 51.96 .. 42.72
## X2009..YR2009. X2010..YR2010. X2011..YR2011. X2012..YR2012.
## 1 .. .. .. ..
## 2 .. .. .. 28.96
## 3 .. .. .. ..
## 4 .. .. .. ..
## 5 .. .. .. ..
## 6 .. .. .. ..
## X2013..YR2013. X2014..YR2014. X2015..YR2015. X2016..YR2016.
## 1 .. .. .. ..
## 2 .. .. .. ..
## 3 .. .. .. ..
## 4 .. .. .. ..
## 5 .. .. .. ..
## 6 .. .. .. ..
The data set is clearly ‘untidy’ given that each year has its own column. Each number is the GINI index number. There are some columns here that we do not need as they are redundant (i.e. Series.Name, Series.Code). The next few steps are to ‘tidy’ up the data and to delete the redundant columns.
Given that this is a ‘tidyr’ and ‘dplyr’ project, we will use these two packages to help query our data.frame. We notice that from the original data frame, there are columns: 1990, 2000, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016. We will use the gather() function from dplyr to ‘tidy’ up the data.
#First, we will rename the columns so that it is easier to read.
colnames(GINI.rawfile) <- c("Series.Name", "Series.Code", "Country.Name", "Country.Code", 1990, 2000, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016)
# Then we will replace ".." with NA
GINI.rawfile[GINI.rawfile == '..'] <- NA
# Next we will gather the data and then eliminate the columns "Series.Name" and "Series.Code"
GINI <- GINI.rawfile %>% gather(Year, GINI_Index, c(5:16), na.rm = TRUE) %>% group_by(Country.Name) %>% select(-c(Series.Name, Series.Code)) %>% arrange(Country.Name)
# We must remember that all of the GINI_Index numbers are saved as characters. We must convert them into numbers so that we can perform calculations and statistical analysis.
GINI <- transform(GINI, GINI_Index = as.numeric(GINI_Index))
head(GINI)
## Country.Name Country.Code Year GINI_Index
## 1 Albania ALB 2008 29.98
## 2 Albania ALB 2012 28.96
## 3 Angola AGO 2000 51.96
## 4 Angola AGO 2008 42.72
## 5 Argentina ARG 2000 51.06
## 6 Argentina ARG 2007 47.37
GINI is now formatted into a ‘tidy’ format that can now be utilized for analysis.
According to the threat, an analysis that could be performed is the trend in the GINI coefficient for each country (or continent) or the average of the GINI coefficient.
Let’s demonstrate the trend for the countries GINI coefficient scores.
# In this example, we will use a subset of the data. In this case, let's use multiple countries: United States, Slovak Republic, Spain, United Kingdom, Uruguay, Russian Federation, Montenegro, Mexico, China
GINI.subset <- GINI %>% filter(Country.Name %in% c('United States', 'Slovak Republic', 'Spain', 'United Kingdom', 'Uruguay', 'Russian Federation', 'Montenegro', 'Mexico', 'China'))
ggplot(GINI.subset, aes(x = Year, y = GINI_Index)) + geom_jitter(width = 0.3, height = 0.3, aes(color = as.factor(Country.Name))) + geom_line(aes(group = Country.Name), lty = 2, color = "blue") + labs(title = "Subset of Countries: GINI Index Plots", x = "Year", y = "GINI Index")
As you can see that, the countries in this subset data frame did not vary significantly throughout the course of the years. Let’s see what happens if we plot this for the entire set!
ggplot(GINI, aes(x = Year, y = GINI_Index)) + geom_jitter(width = 0.3, height = 0.3, aes(color = as.factor(Country.Name))) + geom_line(aes(group = Country.Name), lty = 2, color = "purple") + labs(title = "Subset of Countries: GINI Index Plots", x = "Year", y = "GINI Index") + theme(legend.position = "none")
The above is a huge mess, and difficult to use for analysis. Perhaps that it appears that most of the countries GINI indices more or less appears the same over the course of several years. It appears that we are better off creating subsets and then creating our analysis from there (like the example above).
Let’s find the average GINI index for each country. We will use the dplyr package for this process.
GINI.avg.per.country <- GINI %>% group_by(Country.Name) %>% summarise(AVG_GINI = mean(GINI_Index))
GINI.avg.per.country
## # A tibble: 140 × 2
## Country.Name AVG_GINI
## <fctr> <dbl>
## 1 Albania 29.47000
## 2 Angola 47.34000
## 3 Argentina 45.05333
## 4 Armenia 30.75125
## 5 Australia 35.28500
## 6 Austria 30.67667
## 7 Azerbaijan 31.79000
## 8 Bangladesh 32.77000
## 9 Belarus 28.03444
## 10 Belgium 28.63667
## # ... with 130 more rows
We note that this is in alphabetical order, which is fine, but probably useless if you are trying to find the max or min of the GINI index.
# Here we will use the arrange function in dplyr to find the max and min of the country's GINI index.
par(mfrow = c(1,2))
GINI.avg.per.country.order <- GINI %>% group_by(Country.Name) %>% summarise(AVG_GINI = mean(GINI_Index)) %>% arrange(AVG_GINI)
head(GINI.avg.per.country.order)
## # A tibble: 6 × 2
## Country.Name AVG_GINI
## <fctr> <dbl>
## 1 Slovenia 24.72000
## 2 Ukraine 25.21500
## 3 Slovak Republic 26.14333
## 4 Czech Republic 26.27000
## 5 Norway 26.48333
## 6 Sweden 27.00500
GINI.avg.per.country.order.max <- GINI.avg.per.country.order %>% arrange(desc(AVG_GINI))
head(GINI.avg.per.country.order.max)
## # A tibble: 6 × 2
## Country.Name AVG_GINI
## <fctr> <dbl>
## 1 South Africa 61.38667
## 2 Namibia 60.97000
## 3 Haiti 60.79000
## 4 Botswana 60.46000
## 5 Central African Republic 56.24000
## 6 Zambia 55.62000
Interestingly, the top 6 counties with the worst GINI indices (higher the number, and hence, worse the inequality) are all in Africa, while the most of the top 6 countries with the best GINI indices are all in Eastern Europe, where the Soviet Union and Communism had held its grip. The GINI index is likely to be shaped largely from its past history.
paste0("The AVERAGE world GINI index is: ", round(mean(GINI.avg.per.country$AVG_GINI), 2))
## [1] "The AVERAGE world GINI index is: 39.34"
GINI.USA <- GINI %>% filter(Country.Name == 'United States')
paste0("The AVERAGE USA GINI index is: ", round(mean(GINI.USA$GINI_Index), 2))
## [1] "The AVERAGE USA GINI index is: 40.93"
The world average and the US average is pretty close. However, at what point (high or low), does a GINI index start to become unusual or an outlier?
We can start with a boxplot and plot all of the countrie’s AVG_GINI scores and see if any fall as an outlier. Let’s start with a summary using the describe() function in the psych library and the summary() function.
GINI.stat <- describe(GINI.avg.per.country$AVG_GINI)
GINI.qq <- summary(GINI.avg.per.country$AVG_GINI)
GINI.stat
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 140 39.34 8.57 38.23 38.75 8.9 24.72 61.39 36.67 0.55 -0.41
## se
## X1 0.72
GINI.qq
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.72 32.82 38.24 39.34 44.87 61.39
This function provides us with an incredible amount of descriptive statistics. Noted is a skew of 0.55. Let’s take a look at the histogram and q-q plot to determine if it appears like a normal distribution curve or if it is skewed.
# In this example, we will use base R plot() instead of the histogram function in ggplot2
hist(GINI.avg.per.country$AVG_GINI, prob = TRUE, main = "GINI Average per Country", xlab = "GINI Index")
x <- seq(20, 70, length = 10000)
y <- dnorm(x, mean = GINI.stat$mean, sd = GINI.stat$sd)
lines(x, y, type = 'l', lwd = 2, col = 'blue')
qqnorm(GINI.avg.per.country$AVG_GINI)
qqline(GINI.avg.per.country$AVG_GINI)
There appears to be some slight right skew to this dataset. Though given that N size is 140, we may potentially (keyword: potentially) overlook the right skew and use it to peform statistical analysis. In this case, I would like to see if South Africa’s GINI index is particularly unusual for the world. In this case, we will check and see how many standard deviations South Africa is away from the mean.
# Let's obtain the GINI mean index value for South Africa
South.Africa.GINI <- GINI.avg.per.country[GINI.avg.per.country$Country.Name == 'South Africa', 'AVG_GINI']
paste("South Africa Average GINI index: ", round(South.Africa.GINI, 2))
## [1] "South Africa Average GINI index: 61.39"
# Calculate the z-score and percentile.
world.mean <- mean(GINI.avg.per.country$AVG_GINI)
South.Africa.Z <- (South.Africa.GINI - world.mean)/GINI.stat$sd
South.Africa.prob <- pnorm(South.Africa.Z$AVG_GINI, mean = 0, sd = 1)
paste("South Africa is", round(South.Africa.Z$AVG_GINI,2), "standard deviations away from the mean and is", round(South.Africa.prob, 2) * 100, "percentile.")
## [1] "South Africa is 2.57 standard deviations away from the mean and is 99 percentile."
It looks like South Africa has some work to do to try to balance out the wealth inequality in their country. According to Wikipedia, South Africa has a significant population that is on welfare and significant wealth inequality. It’s country’s history was mired with a significant amount of internal conflicts and the apartheid. I suspect that these issues may have likely contributed to South Africa’s inequality of wealth and subsequent GINI index score.
To help visualize the countries and their GINI indices, I had taken the opportunity to show a visual. Below is a map of all the countries and their GINI index. The “redder” the color, the higher the index. Again, it’s interesting to see where each country’s index is.
GINI.map <- GINI %>% group_by(Country.Code) %>% summarise(AVG_GINI = mean(GINI_Index))
data(countryExData)
map <- joinCountryData2Map(GINI.map, joinCode = "ISO3", nameJoinColumn = "Country.Code")
## 135 codes from your data successfully matched countries in the map
## 5 codes from your data failed to match with a country code in the map
## 108 codes from the map weren't represented in your data
par(mai=c(0,0,0.2,0),xaxs="i",yaxs="i")
mapCountryData(map, nameColumnToPlot="AVG_GINI")
Who wouldn’t want to win? The New York State Government keeps track of all the drawn winning numbers. This database contains information from June 2014 to March 2017.
So how do you play Cash 4 Life? Each game costs $2. Players choose (or have the terminal select the numbers, which is known as “quick pick” in Maryland, New Jersey, New York, Pennsylvania, and Tennessee; and “easy pick” in Virginia) 5 of 60 numbers in the main field, and 1 of 4 (hence the game’s name) green “Cash Ball” numbers in a second field. Matching all six numbers wins, or shares (“split-prize liability”), the equivalent of $1,000-per-day-for-life, or $7,000,000 cash, at the winner’s option. Second prize, however, can have multiple winners of $1,000-per-week-for-life and/or $1,000,000 cash. (Wikipedia)
Below is what the state provides as the odds of winning.
Wouldn’t it be great if we could see what ball numbers came more frequently? If we knew this answer, then how great would it be if we could tilt the system in our favor so that we can increase the odds of winning Cash 4 Life. Though, in theory, this game is based on independence and fairness, it hasn’t been the first time in history where something shady has been going on, and the system was rigged. (Look at my home state New Jersey; it’s politics are all messed up!)
Let’s get to work and and import the .csv. https://data.ny.gov/Government-Finance/Lottery-Cash-4-Life-Winning-Numbers-Beginning-2014/kwxv-fwze This dataset was downloaded and subsequently placed into my github account.
# Reading the .csv file from my github page
lottery <- read.csv("https://raw.githubusercontent.com/jcp9010/MSDA/master/LotteryCash4Life2014.csv", header = TRUE)
# This shows the first 6 rows of this data.frame
head(lottery)
## Draw.Date Winning.Numbers Cash.Ball
## 1 6/16/14 09 36 44 53 59 3
## 2 6/19/14 08 13 43 56 60 2
## 3 6/23/14 05 16 21 33 47 4
## 4 6/26/14 15 22 51 52 58 3
## 5 6/30/14 01 04 10 28 33 2
## 6 7/3/14 08 10 25 28 31 2
While this format is perfectly acceptable as a posting on a website, this format needs some tidying so that we can perform some analysis. I am going to separate each number with the stringr() library (and with Regular Expressions). Though it is assumed that each ball is independent and has an equal chance of being picked, I want to see which numbers had came up most frequently (and least frequently).
# We will separate all the winning numbers and create separate columns i.e. in row 1, Ball 1: 09, Ball 2: 36, Ball 3: 44, Ball 4: 53, Ball 5: 59
lottery.separated <- unlist(str_extract_all(lottery$Winning.Numbers, "(\\d)."))
lottery.separated <- as.numeric(lottery.separated)
lottery2 <- matrix(lottery.separated, ncol = 5, byrow = TRUE)
lottery2 <- as.data.frame(lottery2)
lottery.untidy <- data.frame(lottery$Draw.Date, lottery2, lottery$Cash.Ball)
colnames(lottery.untidy) <- c("Draw.Date", 1, 2, 3, 4, 5, "Cash.Ball")
# As you can see, this separated all the numbers into its own fields
head(lottery.untidy)
## Draw.Date 1 2 3 4 5 Cash.Ball
## 1 6/16/14 9 36 44 53 59 3
## 2 6/19/14 8 13 43 56 60 2
## 3 6/23/14 5 16 21 33 47 4
## 4 6/26/14 15 22 51 52 58 3
## 5 6/30/14 1 4 10 28 33 2
## 6 7/3/14 8 10 25 28 31 2
So now we have created an untidy data frame with all of the Cash 4 Life Winning Numbers. We will now utilize dplyr and tidyr to ‘tidy’ up the data. I am only interested at this time for the first 5 draws, and not the Cash.Ball at this time. We will deselect Cash.Ball from this tidy dataset.
lottery.tidy <- lottery.untidy %>% gather(BallOrder, Number, 2:6) %>% select(-Cash.Ball)
head(lottery.tidy)
## Draw.Date BallOrder Number
## 1 6/16/14 1 9
## 2 6/19/14 1 8
## 3 6/23/14 1 5
## 4 6/26/14 1 15
## 5 6/30/14 1 1
## 6 7/3/14 1 8
As of note, the thread did not specify an exact question. Therefore, I will be using my own questions for this data set.
Now, let’s start analyzing!
# First, we will perform a quick visualization of the data.
# The fastest way here is using a histogram. The histogram will allow us to identify any numbers that appear unusually high or low.
hist(lottery.tidy$Number, breaks = 60, main = "Number Drawn", xlab = "Ball Number", ylim = c(0,40), col = 'lightgreen')
Ball.Num <- as.numeric(lottery.tidy$Number)
Ball.Num <- as.data.frame(table(Ball.Num))
head(Ball.Num)
## Ball.Num Freq
## 1 1 22
## 2 2 16
## 3 3 28
## 4 4 27
## 5 5 30
## 6 6 26
Now with the frequency calculated for each number of ball. Let’s find out some more information about these ball numbers.
Freq.Drawn <- Ball.Num %>% arrange(desc(Freq))
Freq.Drawn
## Ball.Num Freq
## 1 43 37
## 2 37 35
## 3 8 34
## 4 11 31
## 5 56 31
## 6 5 30
## 7 38 30
## 8 40 30
## 9 19 29
## 10 55 29
## 11 3 28
## 12 20 28
## 13 24 28
## 14 25 28
## 15 4 27
## 16 17 27
## 17 28 27
## 18 6 26
## 19 9 26
## 20 32 25
## 21 49 25
## 22 7 24
## 23 12 24
## 24 16 24
## 25 27 24
## 26 47 24
## 27 14 23
## 28 22 23
## 29 26 23
## 30 29 23
## 31 30 23
## 32 34 23
## 33 36 23
## 34 44 23
## 35 48 23
## 36 54 23
## 37 1 22
## 38 13 22
## 39 35 22
## 40 57 22
## 41 58 22
## 42 60 22
## 43 10 21
## 44 21 21
## 45 23 21
## 46 41 21
## 47 45 21
## 48 51 21
## 49 53 21
## 50 59 21
## 51 31 19
## 52 46 19
## 53 18 18
## 54 52 18
## 55 33 17
## 56 39 17
## 57 2 16
## 58 50 16
## 59 15 15
## 60 42 9
paste("The Most Drawn Ball is: ", Freq.Drawn$Ball.Num[1])
## [1] "The Most Drawn Ball is: 43"
paste("The Least Drawn Ball is: ", Freq.Drawn$Ball.Num[length(Freq.Drawn$Ball.Num)])
## [1] "The Least Drawn Ball is: 42"
Ball.stat <- Freq.Drawn %>% summarise(Mean= mean(Freq))
Ball.stat <- as.numeric(Ball.stat)
paste("A number was drawn on average: ", Ball.stat)
## [1] "A number was drawn on average: 23.75"
So at least from the histogram, the majority of the numbers appear to be drawn fairly evenly. But as you can see, ball 43 comes very often and ball 42 does not over the course of the 3 years. However, why does the number 43 occur so frequently and 42 so little? Is this variance or a statistical anomaly?
We’ll make a null and alternative hypothesis. Utilizing the 2 tailed p value test (using alpha = 0.05), we will determine if there is a statistical anomaly.
The null hypothesis is that the number 43 and 42 are variance and are not statistical outliers.
The alternative hypothesis is that the NY State Lottery system is rigged and weighs number 43 and 42 differently from the rest of the numbers.
# Calculating the standard deviation, which we will use to calculate the Z-score
Ball.sd <- sd(Freq.Drawn$Freq)
# The Z-score is obtained and coverted into a two-taled p value Z.43.p
Z.43 <- (37 - Ball.stat)/Ball.sd
Z.43 <- pnorm(Z.43, mean = 0, sd = 1)
Z.43.p <- (1 - Z.43) * 2
# Again, the same process takes place here, except that this time, it is for ball 42
Z.42 <- (9 - Ball.stat)/Ball.sd
Z.42 <- pnorm(Z.42, mean = 0, sd = 1)
Z.42.p <- (Z.42) * 2
paste("The P-value for Ball 43: ", Z.43.p)
## [1] "The P-value for Ball 43: 0.00866536278733454"
paste("The P-value for Ball 42: ", Z.42.p)
## [1] "The P-value for Ball 42: 0.00347624721410393"
Both these values are less than alpha = 0.05, thus making them statistically significant. Now does this actually mean that the NYS Lottery System is rigged? The numbers make a strong case for rejecting the null hypothesis, and that perhaps, the mafia did indeed rig the Cash 4 Life Lottery system.
Even though the p-values suggest that we reject the null hypothesis, we have to remember that this lottery has only been played 185 times since June 2014, which is hardly a very large sample size (given that the combination of 5 balls from 60 is huge!). I may be tempted to choose 43 for my future Cash 4 Life lottery tickets, but I need to keep in mind that this may potentially be a type I error (despite a very very low p value.) We will need to likely approach this topic again in a couple of years when we collect more data, and process our calculations again to see if this does indeed to be true.
Perhaps we should step back and ask the more important question. Should we even play this game at all? The gambling industry is built so that the odds are in the casino’s (or in this case, the state’s) favor. If we were to play the game in the long run, we all on average would turn out to be losers.
Let’s calculate this out using the information provided above.
# Let's calculate the expected value
# Let's assume that for life is 30 years
Ex.Value <- (1/13)*(2) + (1/28)*(4) + (1/83)*(10) + (1/490)*(25) + (1/1471)*(100) + (1/26480)*(500) + (1/79440)*(2500) + (1/7282016)*(1000*30*52) + (1/21846048)*(1000*30*52*7) - 2
paste("The amount on average you expect to win/lose each time you play: $", round(Ex.Value, 2))
## [1] "The amount on average you expect to win/lose each time you play: $ -0.7"
So despite the exciting prospect of having potentially a $1000/day for the rest of your life, the odds are against your favor, with you losing $0.70 every time you play.
So if we can’t find happiness in the lottery system, maybe we should look somewhere else for our happiness. Maybe there is something more to life than gambling. Fortunately, there is a great database on the Kaggle website that provides this type of information. This is the “World Happiness Report”.
https://www.kaggle.com/unsdsn/world-happiness
The World Happiness Report is a landmark survey of the state of global happiness. The first report was published in 2012, the second in 2013, and the third in 2015. The World Happiness Report 2016 Update, which ranks 156 countries by their happiness levels, was released today in Rome in advance of UN World Happiness Day, March 20th. Leading experts across fields – economics, psychology, survey analysis, national statistics, health, public policy and more – describe how measurements of well-being can be used effectively to assess the progress of nations. The reports review the state of happiness in the world today and show how the new science of happiness explains personal and national variations in happiness. They reflect a new worldwide demand for more attention to happiness as a criteria for government policy.
The happiness scores and rankings use data from the Gallup World Poll. The scores are based on answers to the main life evaluation question asked in the poll. This question, known as the Cantril ladder, asks respondents to think of a ladder with the best possible life for them being a 10 and the worst possible life being a 0 and to rate their own current lives on that scale. The scores are from nationally representative samples for the years 2013-2015 and use the Gallup weights to make the estimates representative. The columns following the happiness score estimate the extent to which each of six factors – economic production, social support, life expectancy, freedom, absence of corruption, and generosity – contribute to making life evaluations higher in each country than they are in Dystopia, a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors. They have no impact on the total score reported for each country, but they do explain why some countries rank higher than others.
The thread in the blackboard asks us to analyze this dataset, and to analyze if there are any potential correction between happiness with the economy and/or freedom of the country.
From the thread: “I think it would be interesting to see if there’s a correlation between happiness score and economy, and happiness score and freedom. Those were two quick things that came to mind.”
In the first two datasets, I had loaded the .csv files into MySQL, and then used package RMySQL() to upload the two .csv files (Happy2015 and Happy 2016 into R).
Below is a link to my .SQL file. https://raw.githubusercontent.com/jcp9010/MSDA/master/happy.sql
# Establishing a connection with MySQL
con <- dbConnect(RMySQL::MySQL(), dbname = "happiness", user = "root", password = "MSDA", host = "localhost")
dbListTables(con)
## [1] "Happy2015" "Happy2016"
# Creating SQL queries
Happy2015.query <- "SELECT * FROM Happy2015;"
Happy2016.query <- "SELECT * FROM Happy2016;"
# Assigning SQL tables to data frames in R
Happy2015 <- dbGetQuery(con, Happy2015.query)
Happy2016 <- dbGetQuery(con, Happy2016.query)
# Disconnect from MySQL
dbDisconnect(con)
## [1] TRUE
# Presenting the first 6 rows from each data frame
head(Happy2015)
## COUNTRY REGION HAPPINESS_RANK
## 1 Afghanistan Southern Asia 153
## 2 Albania Central and Eastern Europe 95
## 3 Algeria Middle East and Northern Africa 68
## 4 Angola Sub-Saharan Africa 137
## 5 Argentina Latin America and Caribbean 30
## 6 Armenia Central and Eastern Europe 127
## HAPPINESS_SCORE STANDARD_ERROR ECONOMY_GDP FAMILY HEALTH FREEDOM
## 1 3.575 0.03084 0.31982 0.30285 0.30335 0.23414
## 2 4.959 0.05013 0.87867 0.80434 0.81325 0.35733
## 3 5.605 0.05099 0.93929 1.07772 0.61766 0.28579
## 4 4.033 0.04758 0.75778 0.86040 0.16683 0.10384
## 5 6.574 0.04612 1.05351 1.24823 0.78723 0.44974
## 6 4.350 0.04763 0.76821 0.77711 0.72990 0.19847
## TRUST GENEROSITY DYSTOPIA
## 1 0.09719 0.36510 1.95210
## 2 0.06413 0.14272 1.89894
## 3 0.17383 0.07822 2.43209
## 4 0.07122 0.12344 1.94939
## 5 0.08484 0.11451 2.83600
## 6 0.03900 0.07855 1.75873
head(Happy2016)
## COUNTRY REGION HAPPINESS_RANK
## 1 Afghanistan Southern Asia 154
## 2 Albania Central and Eastern Europe 109
## 3 Algeria Middle East and Northern Africa 38
## 4 Angola Sub-Saharan Africa 141
## 5 Argentina Latin America and Caribbean 26
## 6 Armenia Central and Eastern Europe 121
## HAPPINESS_SCORE LOWER_CONFIDENCE UPPER_CONFIDENCE ECONOMY_GDP FAMILY
## 1 3.360 3.288 3.432 0.38227 0.11037
## 2 4.655 4.546 4.764 0.95530 0.50163
## 3 6.355 6.227 6.483 1.05266 0.83309
## 4 3.866 3.753 3.979 0.84731 0.66366
## 5 6.650 6.560 6.740 1.15137 1.06612
## 6 4.360 4.266 4.454 0.86086 0.62477
## HEALTH FREEDOM TRUST GENEROSITY DYSTOPIA
## 1 0.17344 0.16430 0.07112 0.31268 2.14558
## 2 0.73007 0.31866 0.05301 0.16840 1.92816
## 3 0.61804 0.21006 0.16157 0.07044 3.40904
## 4 0.04991 0.00589 0.08434 0.12071 2.09459
## 5 0.69711 0.42284 0.07296 0.10989 3.12985
## 6 0.64083 0.14037 0.03616 0.07793 1.97864
I realize that loading a database from a local MySQL does not necessarily lead to the reproducibility. Please use the code below. I have intentionally placed this code in comments, so all you would need to do is to delete the #, and the code should work.
#Happy2015 <- read.csv("https://raw.githubusercontent.com/jcp9010/MSDA/master/2015.csv", header = TRUE)
#colnames(Happy2015) <- c("COUNTRY", "REGION", "HAPPINESS_RANK", "HAPPINESS_SCORE", "STANDARD_ERROR", "ECONOMY_GDP", "FAMILY", "HEALTH", "FREEDOM", "TRUST", "GENEROSITY", "DYSTOPIA")
#Happy2016 <- read.csv("https://raw.githubusercontent.com/jcp9010/MSDA/master/2016.csv", header = TRUE)
#colnames(Happy2016) <- c("COUNTRY", "REGION", "HAPPINESS_RANK", "HAPPINESS_SCORE", "LOWER_CONFIDENCE", "UPPER_CONFIDENCE", "ECONOMY_GDP", "FAMILY", "HEALTH", "FREEDOM", "TRUST", "GENEROSITY", "DYSTOPIA")
If you happen to take a look at the data, you will notice that the data is actually in a tidy format, which is great if I was doing data wrangling for a company/university. It would save me a tremendous amount of time, and make performing functions much easier. However, that being said, this project is about creating a tidy data frame using the tidyr and dyplyr packages. In this excercise, I will proceed to do is make the data “untidy”.
I have two separate data frames, Happy2015 and Happy2016. I want to combine the two data sets together, and then tidy the data right back up. Ultimately, what I like to do is attempt to correlate the economy and freedom to a country’s happiness and see if a correlation exists.
Happy2015 <- Happy2015 %>% select(COUNTRY, HAPPINESS_RANK, HAPPINESS_SCORE, ECONOMY_GDP, FREEDOM)
Happy2016 <- Happy2016 %>% select(COUNTRY, HAPPINESS_RANK, HAPPINESS_SCORE, ECONOMY_GDP, FREEDOM)
colnames(Happy2015) <- c("COUNTRY", "HAPPINESS_RANK2015", "HAPPINESS_SCORE2015", "ECONOMY_GDP2015", "FREEDOM2015")
colnames(Happy2016) <- c("COUNTRY", "HAPPINESS_RANK2016", "HAPPINESS_SCORE2016", "ECONOMY_GDP2016", "FREEDOM2016")
Happy <- merge(Happy2015, Happy2016, by = "COUNTRY")
head(Happy)
## COUNTRY HAPPINESS_RANK2015 HAPPINESS_SCORE2015 ECONOMY_GDP2015
## 1 Afghanistan 153 3.575 0.31982
## 2 Albania 95 4.959 0.87867
## 3 Algeria 68 5.605 0.93929
## 4 Angola 137 4.033 0.75778
## 5 Argentina 30 6.574 1.05351
## 6 Armenia 127 4.350 0.76821
## FREEDOM2015 HAPPINESS_RANK2016 HAPPINESS_SCORE2016 ECONOMY_GDP2016
## 1 0.23414 154 3.360 0.38227
## 2 0.35733 109 4.655 0.95530
## 3 0.28579 38 6.355 1.05266
## 4 0.10384 141 3.866 0.84731
## 5 0.44974 26 6.650 1.15137
## 6 0.19847 121 4.360 0.86086
## FREEDOM2016
## 1 0.16430
## 2 0.31866
## 3 0.21006
## 4 0.00589
## 5 0.42284
## 6 0.14037
I want to take a subset of this data frame and look further into the Economy GDP. The postulation is that happier countries have beter economies (representing as higher GDPs). If we can look at the 2015 and 2016 data points, we can get a better idea if this is actually the case.
# Now let's look at the economy GDP for 2015 and 2016
Happy.Economy <- Happy %>% group_by(COUNTRY) %>% select(COUNTRY, HAPPINESS_RANK2015, HAPPINESS_RANK2016, ECONOMY_GDP2015, ECONOMY_GDP2016)
head(Happy.Economy)
## Source: local data frame [6 x 5]
## Groups: COUNTRY [6]
##
## COUNTRY HAPPINESS_RANK2015 HAPPINESS_RANK2016 ECONOMY_GDP2015
## <chr> <int> <int> <dbl>
## 1 Afghanistan 153 154 0.31982
## 2 Albania 95 109 0.87867
## 3 Algeria 68 38 0.93929
## 4 Angola 137 141 0.75778
## 5 Argentina 30 26 1.05351
## 6 Armenia 127 121 0.76821
## # ... with 1 more variables: ECONOMY_GDP2016 <dbl>
Happy.Economy.2015 <- Happy.Economy %>% gather(Happiness_Rank_Year, Rank, c(2:3)) %>% filter(Happiness_Rank_Year == "HAPPINESS_RANK2015") %>% select(COUNTRY, Happiness_Rank_Year, Rank, ECONOMY_GDP2015) %>% arrange(Rank)
Happy.Economy.2016 <- Happy.Economy %>% gather(Happiness_Rank_Year, Rank, c(2:3)) %>% filter(Happiness_Rank_Year == "HAPPINESS_RANK2016") %>% select(COUNTRY, Happiness_Rank_Year, Rank, ECONOMY_GDP2016) %>% arrange(Rank)
head(Happy.Economy.2015)
## Source: local data frame [6 x 4]
## Groups: COUNTRY [6]
##
## COUNTRY Happiness_Rank_Year Rank ECONOMY_GDP2015
## <chr> <chr> <int> <dbl>
## 1 Switzerland HAPPINESS_RANK2015 1 1.39651
## 2 Iceland HAPPINESS_RANK2015 2 1.30232
## 3 Denmark HAPPINESS_RANK2015 3 1.32548
## 4 Norway HAPPINESS_RANK2015 4 1.45900
## 5 Canada HAPPINESS_RANK2015 5 1.32629
## 6 Finland HAPPINESS_RANK2015 6 1.29025
head(Happy.Economy.2016)
## Source: local data frame [6 x 4]
## Groups: COUNTRY [6]
##
## COUNTRY Happiness_Rank_Year Rank ECONOMY_GDP2016
## <chr> <chr> <int> <dbl>
## 1 Denmark HAPPINESS_RANK2016 1 1.44178
## 2 Switzerland HAPPINESS_RANK2016 2 1.52733
## 3 Iceland HAPPINESS_RANK2016 3 1.42666
## 4 Norway HAPPINESS_RANK2016 4 1.57744
## 5 Finland HAPPINESS_RANK2016 5 1.40598
## 6 Canada HAPPINESS_RANK2016 6 1.44015
Now, we will perform the complete opposite and determine if a low economic GDP correlates with unhappiness.
# Let's look at the least happy countries and their Economic GDP
Least.Happy.2015 <- Happy.Economy.2015 %>% arrange(desc(Rank))
Least.Happy.2016 <- Happy.Economy.2016 %>% arrange(desc(Rank))
head(Least.Happy.2015)
## Source: local data frame [6 x 4]
## Groups: COUNTRY [6]
##
## COUNTRY Happiness_Rank_Year Rank ECONOMY_GDP2015
## <chr> <chr> <int> <dbl>
## 1 Togo HAPPINESS_RANK2015 158 0.20868
## 2 Burundi HAPPINESS_RANK2015 157 0.01530
## 3 Syria HAPPINESS_RANK2015 156 0.66320
## 4 Benin HAPPINESS_RANK2015 155 0.28665
## 5 Rwanda HAPPINESS_RANK2015 154 0.22208
## 6 Afghanistan HAPPINESS_RANK2015 153 0.31982
head(Least.Happy.2016)
## Source: local data frame [6 x 4]
## Groups: COUNTRY [6]
##
## COUNTRY Happiness_Rank_Year Rank ECONOMY_GDP2016
## <chr> <chr> <int> <dbl>
## 1 Burundi HAPPINESS_RANK2016 157 0.06831
## 2 Syria HAPPINESS_RANK2016 156 0.74719
## 3 Togo HAPPINESS_RANK2016 155 0.28123
## 4 Afghanistan HAPPINESS_RANK2016 154 0.38227
## 5 Benin HAPPINESS_RANK2016 153 0.39499
## 6 Rwanda HAPPINESS_RANK2016 152 0.32846
Again, just like we had done with the Economic GDP, we will perform a similar data tidying with the Freedom ranking.
Happy.Freedom <- Happy %>% group_by(COUNTRY) %>% select(COUNTRY, HAPPINESS_RANK2015, HAPPINESS_RANK2016, FREEDOM2015, FREEDOM2016)
head(Happy.Freedom)
## Source: local data frame [6 x 5]
## Groups: COUNTRY [6]
##
## COUNTRY HAPPINESS_RANK2015 HAPPINESS_RANK2016 FREEDOM2015
## <chr> <int> <int> <dbl>
## 1 Afghanistan 153 154 0.23414
## 2 Albania 95 109 0.35733
## 3 Algeria 68 38 0.28579
## 4 Angola 137 141 0.10384
## 5 Argentina 30 26 0.44974
## 6 Armenia 127 121 0.19847
## # ... with 1 more variables: FREEDOM2016 <dbl>
Happy.Freedom.2015 <- Happy.Freedom %>% gather(Happiness_Rank_Year, Rank, c(2:3)) %>% filter(Happiness_Rank_Year == "HAPPINESS_RANK2015") %>% select(COUNTRY, Happiness_Rank_Year, Rank, FREEDOM2015) %>% arrange(Rank)
Happy.Freedom.2016 <- Happy.Freedom %>% gather(Happiness_Rank_Year, Rank, c(2:3)) %>% filter(Happiness_Rank_Year == "HAPPINESS_RANK2016") %>% select(COUNTRY, Happiness_Rank_Year, Rank, FREEDOM2016) %>% arrange(Rank)
head(Happy.Freedom.2015)
## Source: local data frame [6 x 4]
## Groups: COUNTRY [6]
##
## COUNTRY Happiness_Rank_Year Rank FREEDOM2015
## <chr> <chr> <int> <dbl>
## 1 Switzerland HAPPINESS_RANK2015 1 0.66557
## 2 Iceland HAPPINESS_RANK2015 2 0.62877
## 3 Denmark HAPPINESS_RANK2015 3 0.64938
## 4 Norway HAPPINESS_RANK2015 4 0.66973
## 5 Canada HAPPINESS_RANK2015 5 0.63297
## 6 Finland HAPPINESS_RANK2015 6 0.64169
head(Happy.Freedom.2016)
## Source: local data frame [6 x 4]
## Groups: COUNTRY [6]
##
## COUNTRY Happiness_Rank_Year Rank FREEDOM2016
## <chr> <chr> <int> <dbl>
## 1 Denmark HAPPINESS_RANK2016 1 0.57941
## 2 Switzerland HAPPINESS_RANK2016 2 0.58557
## 3 Iceland HAPPINESS_RANK2016 3 0.56624
## 4 Norway HAPPINESS_RANK2016 4 0.59609
## 5 Finland HAPPINESS_RANK2016 5 0.57104
## 6 Canada HAPPINESS_RANK2016 6 0.57370
Again, for the least happy people.
Least.Happy.Free.2015 <- Happy.Freedom.2015 %>% arrange(desc(Rank))
Least.Happy.Free.2016 <- Happy.Freedom.2016 %>% arrange(desc(Rank))
head(Least.Happy.Free.2015)
## Source: local data frame [6 x 4]
## Groups: COUNTRY [6]
##
## COUNTRY Happiness_Rank_Year Rank FREEDOM2015
## <chr> <chr> <int> <dbl>
## 1 Togo HAPPINESS_RANK2015 158 0.36453
## 2 Burundi HAPPINESS_RANK2015 157 0.11850
## 3 Syria HAPPINESS_RANK2015 156 0.15684
## 4 Benin HAPPINESS_RANK2015 155 0.48450
## 5 Rwanda HAPPINESS_RANK2015 154 0.59201
## 6 Afghanistan HAPPINESS_RANK2015 153 0.23414
head(Least.Happy.Free.2016)
## Source: local data frame [6 x 4]
## Groups: COUNTRY [6]
##
## COUNTRY Happiness_Rank_Year Rank FREEDOM2016
## <chr> <chr> <int> <dbl>
## 1 Burundi HAPPINESS_RANK2016 157 0.04320
## 2 Syria HAPPINESS_RANK2016 156 0.06912
## 3 Togo HAPPINESS_RANK2016 155 0.34678
## 4 Afghanistan HAPPINESS_RANK2016 154 0.16430
## 5 Benin HAPPINESS_RANK2016 153 0.39747
## 6 Rwanda HAPPINESS_RANK2016 152 0.54320
If you notice, a lot of the unhappiest places suffer from low degrees of freedom and GDP. And it’s not suprising that countries like Syria, Afghanistan, Burundi, and Rwanda are on list. These countries suffer from long standing civil wars and ethnic cleansings. In war time environments, it’s difficult to maintain some sense of an economy. When the country’s citizens are continually in danger of death, productivity will be difficult to maintain.
Let’s take Burundi for example. This country has 10.4 million people in a 27816 sq km. This country has been in a civil war since the 1960s with two factions, Tutsi and Hutu, in war. The average life expectancy for a Burundi citizen is 50 years. There has been some degree of media censorship as well, which explains its freedom score of 0.43.
Not surprisingly, the happiest countries demonstrate a strong economy and high levels of freedom. These top 6 countries are all democratic countries. Let’s take Canada for an example. This country has a universal health care system and overall low tuition costs for higher education. There is no censorship, and the people are free to express their minds. In this setting, it’s not difficult to imagine how a person may strive in this environment.
Let’s see if we can analyze this data and see if these claims can be supported by numbers.
Let’s start plotting some of this data so we can get a better visualization of what is going on. This is Happiness vs. GDP for the year 2015.
# Again, we will be using ggplot to help visualize our data.
# I will be plotting all of the countries by their happiness rank against the GDP
g <- ggplot(Happy.Economy, aes(x = COUNTRY, y = ECONOMY_GDP2015)) + geom_point(aes(color = factor(HAPPINESS_RANK2015))) + theme(axis.text.x = element_blank(), axis.ticks.x=element_blank()) + labs(title = "Happiness vs. GDP 2015")
g
Let’s look at it for the year 2016.
h <- ggplot(Happy.Economy, aes(x = COUNTRY, y = ECONOMY_GDP2016)) + geom_point(aes(color = factor(HAPPINESS_RANK2016))) + theme(axis.text.x = element_blank(), axis.ticks.x=element_blank()) + labs(title = "Happiness vs. GDP 2016")
h
Now, let’s see how Freedom holds up against happiness.
i <- ggplot(Happy.Freedom, aes(x = COUNTRY, y = FREEDOM2015)) + geom_point(aes(color = factor(HAPPINESS_RANK2015))) + theme(axis.text.x = element_blank(), axis.ticks.x=element_blank()) + labs(title = "Happiness vs. Freedom 2015")
i
And now for 2016.
j <- ggplot(Happy.Freedom, aes(x = COUNTRY, y = FREEDOM2016)) + geom_point(aes(color = factor(HAPPINESS_RANK2016))) + theme(axis.text.x = element_blank(), axis.ticks.x=element_blank()) + labs(title = "Happiness vs. Freedom 2016")
j
The visuals give you a general sense that a high degree of freedom and GDP are important for a countries’ overall GDP. However, these four visuals are very busy visuals, and can be difficult to use for interpretation. Instead, we will obtain point estimates (i.e. mean) for the top 30 happiest countries and for the top 30 unhappiest countries.
# Creating a vector for the top 30 happiest countries and calculating point estimates i.e. mean for their GDP and Freedom Index in the year 2015 and 2016.
Happy.30.GDP.2015 <- Happy.Economy.2015$ECONOMY_GDP2015[1:30]
UnHappy.30.GDP.2015 <- Least.Happy.2015$ECONOMY_GDP2015[1:30]
Happy.30.GDP.2016 <- Happy.Economy.2016$ECONOMY_GDP2016[1:30]
UnHappy.30.GDP.2016 <- Least.Happy.2016$ECONOMY_GDP2016[1:30]
Now, to calculate some statistics for the GDP for 2015 and 2016
paste("Top 30 Happiest Countriest GDP 2015")
## [1] "Top 30 Happiest Countriest GDP 2015"
describe(Happy.30.GDP.2015)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 30 1.28 0.17 1.32 1.28 0.12 0.96 1.69 0.73 -0.02 -0.3
## se
## X1 0.03
paste("Top 30 Unhappiest Countriest GDP 2015")
## [1] "Top 30 Unhappiest Countriest GDP 2015"
describe(UnHappy.30.GDP.2015)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 30 0.44 0.3 0.33 0.42 0.2 0.02 1.06 1.04 0.6 -0.89 0.06
paste("Top 30 Happiest Countriest GDP 2016")
## [1] "Top 30 Happiest Countriest GDP 2016"
describe(Happy.30.GDP.2016)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 30 1.37 0.17 1.43 1.38 0.14 1.03 1.7 0.67 -0.34 -0.81
## se
## X1 0.03
paste("Top 30 Unhappiest Countriest GDP 2016")
## [1] "Top 30 Unhappiest Countriest GDP 2016"
describe(UnHappy.30.GDP.2016)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 30 0.47 0.29 0.39 0.44 0.25 0.07 1.16 1.09 0.85 -0.07
## se
## X1 0.05
Will perform the same calculations for the Freedom Index:
Happy.30.Freedom.2015 <- Happy.Freedom.2015$FREEDOM2015[1:30]
UnHappy.30.Freedom.2015 <- Least.Happy.Free.2015$FREEDOM2015[1:30]
Happy.30.Freedom.2016 <- Happy.Freedom.2016$FREEDOM2016[1:30]
UnHappy.30.Freedom.2016 <- Least.Happy.Free.2016$FREEDOM2016[1:30]
paste("Top 30 Happiest Countriest Freedom 2015")
## [1] "Top 30 Happiest Countriest Freedom 2015"
describe(Happy.30.Freedom.2015)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 30 0.58 0.08 0.62 0.59 0.05 0.41 0.67 0.26 -0.77 -0.91
## se
## X1 0.01
paste("Top 30 Unhappiest Countriest Freedom 2015")
## [1] "Top 30 Unhappiest Countriest Freedom 2015"
describe(UnHappy.30.Freedom.2015)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 30 0.37 0.14 0.38 0.37 0.13 0.1 0.66 0.56 -0.14 -0.7
## se
## X1 0.03
paste("Top 30 Happiest Countriest Freedom 2016")
## [1] "Top 30 Happiest Countriest Freedom 2016"
describe(Happy.30.Freedom.2016)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 30 0.51 0.07 0.54 0.52 0.05 0.36 0.6 0.23 -0.83 -0.73
## se
## X1 0.01
paste("Top 30 Unhappiest Countriest Freedom 2016")
## [1] "Top 30 Unhappiest Countriest Freedom 2016"
describe(UnHappy.30.Freedom.2016)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 30 0.27 0.15 0.28 0.27 0.18 0 0.59 0.59 0.03 -0.82 0.03
Looking at the data, it appears quite apparent that the means of both the GDP and Freedom Index is significantly higher than their unhappy counterparts. Looking at the standard deviations and doing some quick calculations will show that these results appears quite statistically significant. This demonstrates there is positive correlation between happiness and the GDP and freedom.
Just to get a better idea the different regions of the world and their happiness, below is the map of the happiness in the year 2016.
Happy.map <- Happy2016 %>% select(COUNTRY, HAPPINESS_RANK2016)
data(countryExData)
map2 <- joinCountryData2Map(Happy.map, joinCode = "NAME", nameJoinColumn = "COUNTRY")
## 154 codes from your data successfully matched countries in the map
## 3 codes from your data failed to match with a country code in the map
## 89 codes from the map weren't represented in your data
par(mai=c(0,0,0.2,0),xaxs="i",yaxs="i")
mapCountryData(map2, nameColumnToPlot="HAPPINESS_RANK2016")