DATA 606 Data Project Proposal

Data Preparation

# load data

#install.packages("rio")
#install.packages("RCurl")
#install.packages("bitops")

library(rio)
library(RCurl)
library(bitops)

x <- getURL("https://raw.githubusercontent.com/excelsiordata/DATA606/master/2015.csv")
WHR2015 <- read.csv(text = x, head=TRUE, sep=",", stringsAsFactors=FALSE, col.names = c("Country","Region","Happiness Rank","Happiness Score","Standard Error","Economy (GDP per Capita)","Family","Health (Life Expectancy)","Freedom","Trust (Government Corruption)","Generosity","Dystopia Residual"))

x2 <- getURL("https://raw.githubusercontent.com/excelsiordata/DATA606/master/2016.csv")
WHR2016 <- read.csv(text = x2, head=TRUE, sep=",", stringsAsFactors=FALSE, col.names = c("Country","Region","Happiness Rank","Happiness Score","Standard Error","Economy (GDP per Capita)","Family","Health (Life Expectancy)","Freedom","Trust (Government Corruption)","Generosity","Dystopia Residual"))

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

How have the average happiness scores changed by region between 2015 and 2016?

Cases

What are the cases, and how many are there?

Each case represents a country, and there are 158 countries in the 2015 report. There are 156 countries in the 2016 Report.

Data collection

Describe the method of data collection.

The World Happiness Report was created from the Gallup World Poll data. The Gallup data is collected through surveys done globally either face to face or over the phone.

Type of study

What type of study is this (observational/experiment)?

Being that the data was collected through a survey, this is an observational study. There was no manipulation of any variables, etc.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

The World Happiness Report data can be found here: https://www.kaggle.com/unsdsn/world-happiness

Response

What is the response variable, and what type is it (numerical/categorical)?

Happiness score is the response variable, and it is numerical.

Explanatory

What is the explanatory variable, and what type is it (numerical/categorical)?

Country is the explanatory variable, and it is categorical.

Relevant summary statistics

Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

summary(WHR2015$Happiness.Score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.839   4.526   5.232   5.376   6.244   7.587

mean(WHR2015$Happiness.Score)

## [1] 5.375734

var(WHR2015$Happiness.Score)

## [1] 1.311048

median(WHR2015$Happiness.Score)

## [1] 5.2325

sd(WHR2015$Happiness.Score)

## [1] 1.14501

plot(WHR2015$Happiness.Score, main = "2015 Happiness Score by Frequency", xlab = "Frequency", ylab = "Happiness Score")

summary(WHR2016$Happiness.Score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.732   4.327   5.237   5.282   6.154   7.460

mean(WHR2016$Happiness.Score)

## [1] 5.282395

var(WHR2016$Happiness.Score)

## [1] 1.318002

median(WHR2016$Happiness.Score)

## [1] 5.237

sd(WHR2016$Happiness.Score)

## [1] 1.148043

plot(WHR2016$Happiness.Score, main = "2016 Happiness Score by Frequency", xlab = "Frequency", ylab = "Happiness Score")

Overall, we can see that the happiness scores have decreased on average across all regions. I’ll take a deeper dive into the region specific data during the completion of the project itself.