RPubs and Blog links information

Rpubs link: http://rpubs.com/beancounter/433639
Blog link for googlevis chart: http://jasonperez.learningnomad.com/2018/10/motion-and-geo-chart-of-happiness-and.html (Note: Motion chart will only work with Internet Explorer browser)

Introduction

The purpose of this study is pretty simple, that is to determine whether one’s happiness correlates to getting a cancer on a global scale. We are in the world now where health is compromised as we tend to get busy with a lot of things both on a professional and personal level.
Emotions impact every person’s actions in many ways. In fact, it has a direct effect on our behaviour, lifestyle and most importantly on our health.
Interestingly, studies show that good emotions can make our lives longer and even make us healthier.

Introduction Cont.

A question of interest came to my mind whether one’s happiness correlates to getting a cancer. One would agree that being happy can make you a healthier individual.
Based on several studies and researches, this could be true but is this applicable on a global outlook?
Looking on the latest Happiness Report provided by Gallup World Poll\(^1\), Finland made on the top spot along with other European countries including Australia and New Zealand which made it to the top ten happiest countries.
Surprisingly, the top happiest countries also have the highest cancer incidences. This is quite a surprise, so let us see how statistics can help us provide an insights whether happiness has a correlation on getting cancer.

Problem Statement

Does happiness correlate to having a risk of getting a cancer? Is a happy person a cancer free person? These are the questions why this study was initiated.
The goal of the study is to determine whether there is a relationship between one of our positive emotions that is “happiness” against one of the society’s major health issues which is “cancer”.
The descriptive statistics including several visualisations will provide us an insights whether these two have correlation on each other.
Using statistical methods and analysis, we would be able to see how these two variables fit in the model.

Data

The Happiness Rating per Country data is taken from “Our World in Data” website\(^2\) for which the report is originally extracted the World Happiness Report provided by Gallup World Poll\(^1\). The years considered on the report were 5 years from 2012 to 2016.
The Cancer Incidences per Country data is sourced from “Our World in Data” website\(^3\) for which the data have been originally sourced from Global Health Data Exchange. The Cancer data per country included all cancer cases globally.
The data collected from the website, saved it in CSV format and imported the two datasets in R software.
Data Preprocessing played a big role in getting, understanding, tidying and manipulating data. It is used also for scanning missing values, transformation and getting ready for modelling and analysis.
121 Countries were sampled which have both Cancer Incidences and Happiness Rating data

Data Cont.

Happiness Data

It included the name of the Country and the Year. It has considered only years from 2012 to 2016.
Happiness Rating is one of the main variables which reflects the main rating of the country’s happiness. The rating includes several metrics such as Economy (GDP), Life expectancy, Generosity, Social Support and Freedom.

Cancer Incidence Data

This included Country, Year and the number of Cancer Cases
Cancer Cases included all people who were diagnosed with all types of cancer, both genders and all ages

Data Preprocessing

Some variables were renamed, dropped countries with missing values, and joined the two data sets to come up with one data frame to be used for reporting. The two most important variables such as Cancer Cases and Happiness Rating were transformed using \(log\) function to come up with a normalised/scaled numeric figures.

Descriptive Statistics

It shows that the average cancer incidences are 313,177 (log normalised at 11). Globally, it’s estimated that 42 million people across the world suffered from any of the forms of cancer which is quite alarming.\(^3\) Happiness Rating per country averaged 5.43 (log normalised at 1.7). Countries with both available data are taken. Any missing data for any of the year were completely dropped or excluded.
Any outliers detected have still been included. The reason being is that we do not have enough evidence to show that these are errors therefore no outliers have been removed since it is assumed that data taken from World Gallup and Global Health Data Exchange can be relied upon and used on the analysis.

format(summary(cancer_happiness$Cases), big.mark = ",", trim = TRUE)

##        Min.     1st Qu.      Median        Mean     3rd Qu.        Max. 
##     "1,990"    "17,524"    "41,510"   "313,177"   "184,899" "8,359,517"

round(summary(cancer_happiness$Happiness_Rating),2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.90    4.56    5.34    5.43    6.29    7.66

Visualisation

Several assumptions have been made on the study. The linearity of the models, normality, homoscedasticity and independence are the assumptions on this statistical research.
One of the assumptions mentioned is the normality. This is to make sure that the data are normally distributed. Data are both transformed to make sure we get the normal distribution.
Although it is hard to get a perfectly symmetrical normal distribution after using several transformation methods, the histograms below would show that the data are normally distributed.

hist_cancer <- ggplot(cancer_happiness, aes(x = log(Cases))) +  geom_histogram(fill = "red", color = "blue") + xlab("Number of Cancer Cases") + ggtitle("Histogram of Cancer Cases")
hist_happiness <- ggplot(cancer_happiness, aes(x = log(Happiness_Rating, Year))) +  geom_histogram(fill = "green", color = "blue") + xlab("Happiness Rating") + ggtitle("Histogram of Happiness Rating")
grid.arrange(hist_cancer, hist_happiness, ncol = 2)

Visualisation Cont.

Graphical representation is shown on the plot where it shows the relationship between happiness and the cancer incidences of the sampled countries per year.

ggplot(cancer_happiness, aes(x = log(Cases), y = log(Happiness_Rating), colour = factor(Year))) + geom_line() + labs(x = "Happiness Rating", y = "Cancer Cases", colour = "Year") + ggtitle("Plot of Happiness and Cancer Cases") # Plot

Visualisation Cont.

Scatter plot of Happiness and Cancer Cases of the sampled countries per year is shown below.

It shows that there is a slightly positive linear relationship between the two variables.

ggplot(cancer_happiness, aes(x = log(Happiness_Rating), y = log(Cases), colour = factor(Year))) + geom_jitter(alpha = 1) + geom_smooth(lwd = 0.1, alpha = 0.1) + ggtitle("Scatter Plot of Happiness and Cancer Cases") + labs(x = "Happiness Rating", y = "Cancer Cases", colour = "Year") # Scatter plot

Visualisation Cont.

The interactive visual representation is shown on this link: http://jasonperez.learningnomad.com/2018/10/motion-and-geo-chart-of-happiness-and.html (charts will only work with Internet Explorer browser)
Using GoogleVis package, it presents the geographical map where it shows the happiness rating per country, a table where it shows the number of cancer cases and a motion chart where it shows the movement of happiness and cancer incidences per year per country. The size of the bubble shows the number of people who have been diagnosed with cancer.

data_geomap <- gvisGeoChart(cancer_happiness, "Country", "Happiness_Rating",options=list(width=200, height=150))
cancer_happiness2 <- cancer_happiness %>% filter(Year == 2016)
cancer_happiness2$Year <- as.factor(cancer_happiness2$Year)
cancer_happiness_normal <- cancer_happiness %>% mutate(Happiness_norm = log(Happiness_Rating), Cancer_norm = log(Cases))
data_table <- gvisTable(cancer_happiness2,options=list(width=200, height=270))
data_motion <- gvisMotionChart(cancer_happiness_normal, idvar = "Country", timevar = "Year", xvar = "Happiness_norm", yvar = "Cancer_norm", sizevar = "Cases") 
map_table <- gvisMerge(data_geomap, data_table, horizontal = FALSE)
map_table_motion <- gvisMerge(data_motion, map_table, horizontal = TRUE, tableOptions="bgcolor=\"#CCCCCC\" cellspacing = 10")
plot(map_table_motion)

Hypothesis Testing

This research study is very simple. We want to predict whether there is a correlation between country’s happiness and its cancer incidences. We want to test whether the statistical data we used fit the linear regression model. The hypotheses formulated are shown below:

\(H_0:\) The Country’s Happiness and Cancer Cases data do not fit the linear regression model

\(H_A:\) The Country’s Happiness and Cancer Cases data fit the linear regression model

Assumptions made:

Linearity of the model - linear relationship is present
Normality - this is tested by using the transformed/scaled variables
Independence - assumed to be present when the providers conducted the survey
Homoscedasticity - assumed on the model
Significance level is set at alpha \(0.05\)
Causality is outside the scope of the study which means that a change in the independent variable does not cause changes in the dependent variable.

Interpretation

The model summary results using \(lm\) function are shown below:

results <- (lm(Cases~Happiness_Rating,data=cancer_happiness))
results %>% summary()

## 
## Call:
## lm(formula = Cases ~ Happiness_Rating, data = cancer_happiness)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -566268 -314423 -167451  -39867 8062280 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -478820     191408  -2.502   0.0126 *  
## Happiness_Rating   145875      34534   4.224 2.77e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 947100 on 603 degrees of freedom
## Multiple R-squared:  0.02874,    Adjusted R-squared:  0.02713 
## F-statistic: 17.84 on 1 and 603 DF,  p-value: 2.77e-05

Interpretation Cont.

The Pearson correlation coefficient results are shown below:

cor.test(cancer_happiness$Cases, cancer_happiness$Happiness_Rating, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  cancer_happiness$Cases and cancer_happiness$Happiness_Rating
## t = 4.2241, df = 603, p-value = 2.77e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.09104636 0.24591814
## sample estimates:
##       cor 
## 0.1695287

Interpretation Cont.

p value = 2.77e-05
r = 0.17
There is a weak positive linear relationship between happiness and cancer incidence
Since p value of 2.77e-05 is less than the significance level, we can conclude that there is a statistically significant relationship between happiness and cancer incidences
The decision is to reject the Null Hypothesis \(H_0\) as the data fit the linear regression model

Discussion

There is a correlation between happiness and cancer incidence, however there are other confounding factors that affect these two variables.
We can see that richer countries have high happiness rating while these countries do have high rate of cancer cases. This could be because richer countries tend to use their money to buy everything while compromising and overlooking their health.
These rich countries have a rapid ageing population. Old people have high risk exposure to getting a cancer. This briefly explains a high rate of cancer cases for these highly developed countries.
These countries do have free access to health care system and other benefits so they tend to enjoy life without thinking the repercussion on their health.
The study does not provide a causal relationship between happiness and cancer incidences (outside the scope of the research) although it suggests evidence of its relationships.

Proposed Action for Future Investigation

Since two variables only are used on the study, we could have added one or two variables that have direct or indirect impact on happiness and cancer cases, i.e., GDP, weather, age, access to healthcare, etc. This would provide additional insights and strengthen our conclusion.
Other statistical analysis including non-parametric tests could have been used to further validate the conclusion. In case we add more variables, we could have also used Chi-square Test Association technique to highlight the association betweeen the variables.
We could have done some clustering on the country i.e., Asian region, European region and delimiting the type of cancer by removing skin cancer or focusing only on one type of cancer. This way, our analysis would have been focused and specific.
In order to further strengthen our analysis, we could have added years on top of the existing 5 years on the statistical research so see the trend and analyse accordingly.

Final Conclusion

Although we can say that there is a positive weak linear correlation between happiness and cancer cases, there are other confounding factors that affect these two variables.
Based on statistical findings using the available data, the happier the country is, there is slight probability that the country would have higher incidence of cancer. This is true for majority of the rich and highly developed countries like Australia and other European countries.
In this world full of uncertainty, stress, and noise, the only thing we can do is to make our life healthy and happy so as to reduce the risk of getting cancer. Worrying too much on this dynamic and constantly changing planet is the least we could do, live life to the fullest, eat healthy and chill out.
The message of this study is very basic, life is too short not to enjoy but you have to look after yourself. After all, we live only once so be happy on the inside and out!

Does being happy make you cancer-free?

Exploring Cancer Cases and Happiness Globally

RPubs and Blog links information

Introduction

Introduction Cont.

Problem Statement

Data

Data Cont.

Happiness Data

Cancer Incidence Data

Data Preprocessing

Descriptive Statistics

Visualisation

Visualisation Cont.

Visualisation Cont.

Visualisation Cont.

Hypothesis Testing

Assumptions made:

Interpretation

Interpretation Cont.

Interpretation Cont.

Discussion

Proposed Action for Future Investigation

Final Conclusion

References