Final for IS 607

Edwige Talla Badjio
December 17, 2015

Introduction

Obesity and Health Risk Factors associated

Obesity, aesthetics

risks

Class, end of semester, tools and techniques learned

Objective

  • understand avaiable facts and relevant data

    • Body Mass Index (BMI): Measure of body fat
    • Cholesterol, Blood pressure: Health test results
  • explore the correlations between those variables

  • perform a sentiment analysis

  • analysis: three different data sources

Data Sources

  1. World Health Organisation, WHO - Health Risk factors
    • 6 Excel files (BMI, Blood pressure, Cholesterol), Males, Females
    • Long format 199 rows: Country, years(1980 to 2008)
    • Standard means
  2. Center for Disease Control and Prevention, CDC - NHANES dataset
    • survey data - available in R library NHANES (NHANESraw)
    • 78 attributes, 20293 observations
  3. Social Media data, Twitter

WHO data (1)

Initial format initial

WHO data (2)

  • Format: wide to long
  • Merge all files for males ,females
  • add gender variable
  • add BMI classification ('Underweight', 'Healthy', 'Overweight', 'Obese')
  • add Blood pressure classification ('Normal', 'Pre-HBP', 'Stage1', 'Stage2')

final

WHO data (3)

final

WHO data (4)

final

Data Sources

  1. World Health Organisation, WHO - Health Risk factors
    • 6 Excel files (BMI, Blood pressure, Cholesterol), Males, Females
    • Long format 199 rows: Country, years(1980 to 2008)
    • Standard means

2. Center for Disease Control and Prevention, CDC - NHANES dataset

  • survey data - available in R library NHANES (NHANESraw)
  • 78 attributes, 20293 observations
    1. Social Media data, Twitter

CDC data (1)

  • R package NHANES
  • Subsetting the initial dataset, kept 14 variables from 78

  • Create a factor variable for BMI classification

  • Create a factor variable for Blood Pressure classification based on two variables (Diastolic and Systolic blood pressure)

  • Missing values (work with complete cases)

  • Survey Data (library psych)

CDC data (2)

Predicting the probability of being diagnosed with hypertension based on age,bmi Diabetes, Weight

analysis1

CDC data (3)

Is BMI a good predictor of hypertension and hyperlipidemia? analysis2

Data Sources

  1. World Health Organisation, WHO - Health Risk factors
    • 6 Excel files (BMI, Blood pressure, Cholesterol), Males, Females
    • Long format 199 rows: Country, years(1980 to 2008)
    • Standard means
  2. Center for Disease Control and Prevention, CDC - NHANES dataset
    • survey data - available in R library NHANES (NHANESraw)
    • 78 attributes, 20293 observations

3. Social Media data, Twitter

Twitter data

  • script for data scraping from Twitter API
    • Searches related to: obesity,overweight, body mass index, body fat, anti-obesity drug, appetite, weight control, abdominal obesity.
  • data preprocesing, text file
  • upload to GitHub

  • RWeka (NGramTokenizer, term document matrix for 2-grams)

Conclusion (1)

  • Data, BMI in the world increases
  • multivariate logistic regression analysis
    • BMI significantly associated with blood pressure and the Cholesterol level
  • Unfinished - sentiment analysis

Discovery: Survey data

Conclusion (2)

  • Challenges

    • First steps of the project Defining the hypothesis, getting the data, Making some APIs to work (Facebook), Data access (restricted Fitness trackers)
    • For Coding Excel files download, download.file with mode=“wb” Rweka function NGramTokenizer needed an option options(mc.cores=1)
  • New packages psych - survey, NHANES - CDC data, RWeka - Data mining, Ggally, party, paryekit - Visualization

  • Perspectives : finish sentiment analysis, working with data from other sources like the news, text mining diet success story library(Ggally) -parallel coordinate plots