Final

Introduction
Data Sources
Loading Packages
Some Functions
Obesity in the world (data from WHO)
Analysing Data from CDC
Data from Social Media (Twitter)
- Exploring the Document Term Matrix
Conclusion

Introduction

In this project we are interested in Obesity and health risk factors associated to it, which are huge. Knowing that obesity is not all about aesthetics and given all the tools and techniques we have learned in class, we will like to learn more about that topic. Our objective is to understand avaiable facts and data relevant to the topic. BMI (body mass index), Cholesterol, and hypertension are some of the variables used in risk factors datasets. We will spent some time exploring the correlations between those variables and see for example if BMI is a good predictor of hypertension and hyperlipidia. Our analysis will be based on tree differents data sources.

Data Sources

World Health Organisation, WHO - Health Risk factors
Center for Disease control and Prevention, CDC - NHANES dataset
Social Media data

Loading Packages

library(NHANES)
library(RWeka)

## Warning: package 'RWeka' was built under R version 3.2.3

library(party)

## Warning: package 'party' was built under R version 3.2.3

## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange

## Warning: package 'strucchange' was built under R version 3.2.3

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich

library(partykit)

## Warning: package 'partykit' was built under R version 3.2.3

## 
## Attaching package: 'partykit'
## 
## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner,
##     node_surv, node_terminal

library(tm)

## Loading required package: NLP

library(SnowballC)  
library(stringr)

## 
## Attaching package: 'stringr'
## 
## The following object is masked from 'package:strucchange':
## 
##     boundary

library(RTextTools)

## Loading required package: SparseM
## 
## Attaching package: 'SparseM'
## 
## The following object is masked from 'package:base':
## 
##     backsolve
## 
## 
## Attaching package: 'RTextTools'
## 
## The following objects are masked from 'package:SnowballC':
## 
##     getStemLanguages, wordStem

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

library(GGally)

## 
## Attaching package: 'GGally'
## 
## The following object is masked from 'package:dplyr':
## 
##     nasa

Some Functions

#The following function will take care of the rows with NAs
delete.na <- function(dframe, n=0) {
  part <- apply(dframe, 2, is.na)
  partindex <- apply(part, 1, function(x) sum(x) <= n)
  dframe[partindex, ]
}
#The function below will be used for social media text mining
preprocessing <- function(tweets, nbgrams)
{

#Creating the coupus from the directory files
the_corpus <- Corpus(VectorSource(tweets))

#Removing  punctuation
the_corpus <- tm_map(the_corpus, removePunctuation) 

#Removing numbers
the_corpus <- tm_map(the_corpus, removeNumbers)
the_corpus <- tm_map(the_corpus, stemDocument) 

#Removing  words that usually have no analytic value
the_corpus <- tm_map(the_corpus, removeWords, stopwords("english")) 

#Removing  white spaces
the_corpus <- tm_map(the_corpus, stripWhitespace) 
the_corpus <- tm_map(the_corpus, content_transformer(tolower))
the_corpus <- tm_map(the_corpus, PlainTextDocument)

options(mc.cores=1) 
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = nbgrams, max = nbgrams)) # create n-grams using RWEKA
tdm <- TermDocumentMatrix(the_corpus, control = list(tokenize = BigramTokenizer,  weighting = weightTf)) # create tdm from n-grams
tdm
}

Obesity in the world (data from WHO)

We spent some time trying to figure out how to do the download process automatically but it did not work. We tried two different packages: gdata and httr without success.

Downloading pre-treated CSV files from GitHub.

The data initially was manually downloaded from Gapminder http://www.gapminder.org/data/

setwd("c:/data/is607")

fileUrl1 <- "https://raw.githubusercontent.com/fangseup88/finals/master/data_finals/Indicator_BMIfemale.csv"
fileUrl2 <- "https://raw.githubusercontent.com/fangseup88/finals/master/data_finals/Indicator_BMImale.csv"
fileUrl3 <- "https://raw.githubusercontent.com/fangseup88/finals/master/data_finals/Indicator_SBPfemale.csv"
fileUrl4 <- "https://raw.githubusercontent.com/fangseup88/finals/master/data_finals/Indicator_SBPmale.csv"
fileUrl5 <- "https://raw.githubusercontent.com/fangseup88/finals/master/data_finals/Indicator_TCfemale.csv"
fileUrl6 <- "https://raw.githubusercontent.com/fangseup88/finals/master/data_finals/Indicator_TCmale.csv"

download.file(fileUrl1, destfile="Indicator_BMIfemale.csv", mode="wb")
download.file(fileUrl2, destfile="Indicator_BMImale.csv", mode="wb")
download.file(fileUrl3, destfile="Indicator_SBPfemale.csv", mode="wb")
download.file(fileUrl4, destfile="Indicator_SBPmale.csv", mode="wb")
download.file(fileUrl5, destfile="Indicator_TCfemale.csv", mode="wb")
download.file(fileUrl6, destfile="Indicator_TCmale.csv", mode="wb")

BMIdataMale = read.csv("Indicator_BMImale.csv", header=TRUE, check.names = FALSE, stringsAsFactors=F)
BMIdataFemale = read.csv("Indicator_BMIfemale.csv", header=TRUE, check.names = FALSE, stringsAsFactors=F)
BPdataMale=read.csv("Indicator_SBPmale.csv", header=TRUE, check.names = FALSE, stringsAsFactors=F)
BPdataFemale=read.csv("Indicator_SBPfemale.csv", header=TRUE, check.names = FALSE, stringsAsFactors=F)
CHOLdataMale=read.csv("Indicator_TCmale.csv", header=TRUE, check.names = FALSE, stringsAsFactors=F)
CHOLdataFemale=read.csv("Indicator_TCfemale.csv", header=TRUE, check.names = FALSE, stringsAsFactors=F)

#Transforming the data from wide format to long format
bmiM <- BMIdataMale  %>% gather("Year","BMI", 2:30) %>% group_by(Year)
bmiF <- BMIdataFemale  %>% gather("Year","BMI", 2:30) %>% group_by(Year)
bpM <- BPdataMale  %>% gather("Year","BP", 2:30)%>% group_by(Year)
bpF <- BPdataFemale  %>% gather("Year","BP", 2:30)%>% group_by(Year)
cholM <- CHOLdataMale  %>% gather("Year","CHOL", 2:30)%>% group_by(Year)
cholF <- CHOLdataFemale  %>% gather("Year","CHOL", 2:30)%>% group_by(Year)

#Merging the data and adding Gender information
females_init <- merge(bmiF, bpF, by=c("Year", "Country"))
females <- merge(females_init, cholF, by=c("Year", "Country"))
head(females)

##   Year             Country      BMI       BP     CHOL
## 1 1980         Afghanistan 20.44348 122.0799 4.644476
## 2 1980             Albania 25.17427 132.2048 5.039529
## 3 1980             Algeria 23.67764 130.8334 4.976215
## 4 1980             Andorra 25.67324 136.6040 6.132187
## 5 1980              Angola 20.06763 130.0896 4.789354
## 6 1980 Antigua and Barbuda 24.22235 125.6712 5.013913

females$Gender <- 'F'
str(females)

## 'data.frame':    5771 obs. of  6 variables:
##  $ Year   : Factor w/ 29 levels "1980","1981",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Country: chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ BMI    : num  20.4 25.2 23.7 25.7 20.1 ...
##  $ BP     : num  122 132 131 137 130 ...
##  $ CHOL   : num  4.64 5.04 4.98 6.13 4.79 ...
##  $ Gender : chr  "F" "F" "F" "F" ...

females$Gender <- as.factor(females$Gender)

males_init <- merge(bmiM, bpM, by=c("Year", "Country"))
males <- merge(males_init, cholM, by=c("Year", "Country"))
head(males)

##   Year             Country      BMI       BP     CHOL
## 1 1980         Afghanistan 21.48678 125.1991 4.582847
## 2 1980             Albania 25.22533 132.9270 5.006371
## 3 1980             Algeria 22.25703 132.4093 4.925933
## 4 1980             Andorra 25.66652 140.8585 6.178972
## 5 1980              Angola 20.94876 135.1761 4.524107
## 6 1980 Antigua and Barbuda 23.31424 132.5722 4.868222

males$Gender <- 'M'
str(males)

## 'data.frame':    5771 obs. of  6 variables:
##  $ Year   : Factor w/ 29 levels "1980","1981",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Country: chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ BMI    : num  21.5 25.2 22.3 25.7 20.9 ...
##  $ BP     : num  125 133 132 141 135 ...
##  $ CHOL   : num  4.58 5.01 4.93 6.18 4.52 ...
##  $ Gender : chr  "M" "M" "M" "M" ...

males$Gender <- as.factor(males$Gender)

allData<-rbind(males, females)

str(allData)

## 'data.frame':    11542 obs. of  6 variables:
##  $ Year   : Factor w/ 29 levels "1980","1981",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Country: chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ BMI    : num  21.5 25.2 22.3 25.7 20.9 ...
##  $ BP     : num  125 133 132 141 135 ...
##  $ CHOL   : num  4.58 5.01 4.93 6.18 4.52 ...
##  $ Gender : Factor w/ 2 levels "M","F": 1 1 1 1 1 1 1 1 1 1 ...

Missing Data

# Remove any empty rows
allData<-allData[!is.na(allData$BMI),] 

allData<-allData[!is.na(allData$Country),]

library(psych)

## 
## Attaching package: 'psych'
## 
## The following object is masked from 'package:ggplot2':
## 
##     %+%

describe(allData)

## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf

## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf

##          vars     n   mean   sd median trimmed   mad    min    max range
## Year*       1 11542  15.00 8.37  15.00   15.00 10.38   1.00  29.00 28.00
## Country*    2 11542    NaN   NA     NA     NaN    NA    Inf   -Inf  -Inf
## BMI         3 11542  24.34 2.49  24.72   24.31  2.45  18.47  35.02 16.55
## BP          4 11542 129.11 4.51 129.39  129.18  4.72 110.34 143.12 32.79
## CHOL        5 11542   4.85 0.48   4.83    4.84  0.51   3.73   6.24  2.52
## Gender*     6 11542   1.50 0.50   1.50    1.50  0.74   1.00   2.00  1.00
##           skew kurtosis   se
## Year*     0.00    -1.20 0.08
## Country*    NA       NA   NA
## BMI       0.15     0.01 0.02
## BP       -0.19    -0.13 0.04
## CHOL      0.25    -0.41 0.00
## Gender*   0.00    -2.00 0.00

str(allData)

## 'data.frame':    11542 obs. of  6 variables:
##  $ Year   : Factor w/ 29 levels "1980","1981",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Country: chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ BMI    : num  21.5 25.2 22.3 25.7 20.9 ...
##  $ BP     : num  125 133 132 141 135 ...
##  $ CHOL   : num  4.58 5.01 4.93 6.18 4.52 ...
##  $ Gender : Factor w/ 2 levels "M","F": 1 1 1 1 1 1 1 1 1 1 ...

Adding BMI and Blood Pressure classes to the data

allData$bmiClass <- cut(allData$BMI, breaks = c(0,18.5,25,30,50), labels = c('Underweight', 'Healthy', 'Overweight', 'Obese'))

allData$bpClass <- cut(allData$BP, breaks = c(0,110,120,140,160), labels = c('Normal', 'Pre-HBP', 'Stage1', 'Stage2'))

Graphical representations for BMI and Blood Pressure

ggplot(aes(x = Year, y = BMI, color = bmiClass, pch = Gender), 
       data = allData) +
  geom_jitter(alpha = 1) +
  scale_x_discrete(breaks = seq(1980,2008,2)) +
  scale_colour_brewer(palette=1) +  xlab("Years") +  ylab("Mean Body Mass Index") + ggtitle('Body Mass Index Evolution in the World Per Year\n')

It is showing that the BMI is going up.

ggplot(aes(x = Year, y = BP, color = bpClass, pch = Gender), 
       data = allData) +
  geom_jitter(alpha = 1) +
  scale_x_discrete(breaks = seq(1980,2008,2)) +
  scale_colour_brewer(palette=3) +  xlab("Years") +  ylab("Mean Blood Pressure") + ggtitle('Blood Pressure Evolution in the World Per Year\n')

The blood pressure went down slightly and stayed almost at the same level.

pairs.panels(allData)

Looking for correlations in the data with Parallel Coordinates Plots

ggparcoord(data=allData, columns = c("BMI", "CHOL", "BP"), groupColumn = "bpClass", order = "allClass", scale="uniminmax")

ggparcoord(data=allData, columns = c("BMI", "CHOL", "BP"), groupColumn = "bmiClass", order = "allClass", scale="uniminmax")

Analysing Data from CDC

library(survey)

## Warning: package 'survey' was built under R version 3.2.3

## 
## Attaching package: 'survey'
## 
## The following object is masked from 'package:graphics':
## 
##     dotchart

library(NHANES)
data(NHANESraw)

#Subsetting the initial dataset
subnhanes <- subset(NHANESraw, select=c(SDMVPSU, SDMVSTRA, Gender, Age, Weight, Height, BMI, Diabetes, BPSysAve, BPDiaAve, DirectChol, TotChol))

#Creating a factor variable for BMI classification
subnhanes$bmiClass <- cut(subnhanes$BMI, breaks = c(0,18.5,25,30,50), labels = c('Underweight', 'Healthy', 'Overweight', 'Obese'))

#Creating a factor variable for Blood Pressure classification based on two variables
subnhanes$bpClass <- ifelse((subnhanes$BPSysAve>=160 & subnhanes$BPDiaAve>=100),"Stage 2 HBP",ifelse((subnhanes$BPSysAve>=140 & subnhanes$BPDiaAve>=90),"Stage 1 HBP",ifelse((subnhanes$BPSysAve>=120 & subnhanes$BPDiaAve>=80),"Pre-HBP","Normal HBP")))
names(subnhanes)

##  [1] "SDMVPSU"    "SDMVSTRA"   "Gender"     "Age"        "Weight"    
##  [6] "Height"     "BMI"        "Diabetes"   "BPSysAve"   "BPDiaAve"  
## [11] "DirectChol" "TotChol"    "bmiClass"   "bpClass"

subnhanes$bpClass <-as.factor(subnhanes$bpClass)

#Taking care of missing values
dim(subnhanes)

## [1] 20293    14

subnhanes<-delete.na(subnhanes)
dim(subnhanes)

## [1] 13409    14

#Survey Design
bmi_design <- svydesign(id = ~ SDMVPSU, strata = ~ SDMVSTRA, nest = TRUE, weight = ~ BMI, data = subnhanes)

chol_design<- svydesign(id = ~ SDMVPSU, strata = ~ SDMVSTRA, nest = TRUE, weight = ~ TotChol, data = subnhanes)

bmi_design

## Stratified 1 - level Cluster Sampling design (with replacement)
## With (62) clusters.
## svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, nest = TRUE, weight = ~BMI, 
##     data = subnhanes)

chol_design

## Stratified 1 - level Cluster Sampling design (with replacement)
## With (62) clusters.
## svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, nest = TRUE, weight = ~TotChol, 
##     data = subnhanes)

#Predicting the probability of being diagnosed with hypertension based on age,bmi Diabetes, Weight
m1<-svyglm(bpClass~Age+bmiClass+Weight+TotChol+Diabetes, design=bmi_design, family=quasibinomial)

#Is BMI a good predictor of hypertension and hyperlipidemia?
m2<-svyglm(bmiClass~Age+TotChol+bpClass+Weight+Diabetes, design=bmi_design, family=quasibinomial)
summary(m1)

## 
## Call:
## svyglm(formula = bpClass ~ Age + bmiClass + Weight + TotChol + 
##     Diabetes, design = bmi_design, family = quasibinomial)
## 
## Survey design:
## svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, nest = TRUE, weight = ~BMI, 
##     data = subnhanes)
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -6.972084   0.301616 -23.116  < 2e-16 ***
## Age                 0.015050   0.001594   9.444 6.89e-10 ***
## bmiClassHealthy     0.592158   0.272711   2.171   0.0392 *  
## bmiClassOverweight  0.629760   0.283781   2.219   0.0354 *  
## bmiClassObese       0.517760   0.288699   1.793   0.0845 .  
## Weight              0.023768   0.001945  12.223 2.78e-12 ***
## TotChol             0.378347   0.022981  16.464 2.86e-15 ***
## DiabetesYes        -0.126422   0.098191  -1.288   0.2093    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 0.9155729)
## 
## Number of Fisher Scoring iterations: 6

summary(m2)

## 
## Call:
## svyglm(formula = bmiClass ~ Age + TotChol + bpClass + Weight + 
##     Diabetes, design = bmi_design, family = quasibinomial)
## 
## Survey design:
## svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, nest = TRUE, weight = ~BMI, 
##     data = subnhanes)
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -10.909339   0.380854 -28.644  < 2e-16 ***
## Age                 -0.010222   0.002473  -4.133  0.00033 ***
## TotChol              0.455146   0.070693   6.438 8.03e-07 ***
## bpClassPre-HBP       0.150515   0.379206   0.397  0.69466    
## bpClassStage 1 HBP  -0.452925   0.739555  -0.612  0.54557    
## bpClassStage 2 HBP  -2.715966   0.753124  -3.606  0.00129 ** 
## Weight               0.222438   0.005708  38.972  < 2e-16 ***
## DiabetesYes          0.772122   0.298154   2.590  0.01553 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 0.2325679)
## 
## Number of Fisher Scoring iterations: 9

library(party)
library(partykit)
if(require("party", quietly = TRUE)) plot(m1)

if(require("party", quietly = TRUE)) plot(m2)

According to our multivariate logistic regression analysis, the BMI seems to be significantly associated with blood pressure and the Cholesterol level.

Data from Social Media (Twitter)

We used a script for data scraping from Twitter API. We performed some preprocesing and created a text file named fromTwitter.txt. That text file can be downloaaded from GitHub. We searched Twitter for: obesity,overweight, body mass index, body fat, anti-obesity drug, appetite, weight control, abdominal obesity.

setwd("c:/data/is607")

fileUrl <- "https://raw.githubusercontent.com/fangseup88/finals/master/data_finals/fromTwitter.txt"

download.file(fileUrl, destfile="fromTwitter.txt", mode="wb")

data = read.csv("fromTwitter.txt", header=FALSE, check.names = FALSE, stringsAsFactors=F)

head(data)

##                                                                                     V1
## 1 Philly is tackling obesity with better planning Similar to our PlanHealth coalitions
## 2                                                 drops a childhood obesity diss track
## 3                     US presidential candidate warns of obesity among British Muslims
## 4                                   How obesity affects the hearts of kids as young as
## 5 AHR Sedentary behavior has dropped bybut US obesity amp diabetes have increasedhttp<U+0085>
## 6                        UTIA ampfinding RealLifeSolutions to fight childhood obesity<U+0085>

tdm <- preprocessing(data, 2)

tdm <- removeSparseTerms(tdm, 0.75)

tdm1 <- as.matrix (tdm)
freq <- colSums(tdm1)

freq <- sort(freq, decreasing = TRUE)
words <-names(freq)
library(wordcloud)

## Loading required package: RColorBrewer

# Create a WordCloud to Visualize the Text Data ---------------------------
m = as.matrix(tdm)
v = sort(rowSums(m),decreasing=TRUE)
d = data.frame(word = names(v),freq=v)
# Create the word cloud
pal = brewer.pal(9,"BuPu")
#wordcloud(words = d$word, freq = d$freq, scale = c(3,.8), random.order = F,           colors = pal)

#library(topicmodels)
#rowTotals <- apply(tdm , 1, sum) 
#tdm.new   <- tdm[rowTotals> 0, ]
#g = LDA(tdm.new,10,method = 'VEM',control=NULL,model=NULL)

Exploring the Document Term Matrix

tdm.matrix <- as.matrix(tdm)

topwords <- rowSums(tdm.matrix)
topwords <- as.numeric(topwords)
hist(topwords, breaks = 10)

findFreqTerms(tdm, lowfreq = 500) # find terms with a frequency higher than 1000

##  [1] "ageif u"               "appetit sleep"        
##  [3] "around abil"           "ask santa"            
##  [5] "avoid heart"           "behavior inhibit"     
##  [7] "bodi fat"              "boy wrote"            
##  [9] "brain lose"            "breast cancer"        
## [11] "bulli overweight"      "caus eat"             
## [13] "caus person<U+0092>"          "chang listen"         
## [15] "colleg destroy"        "control behavior"     
## [17] "control emot"          "control yr"           
## [19] "destroy appetit"       "donat bodi"           
## [21] "dont start"            "emot weight"          
## [23] "end overweight"        "fast safeti"          
## [25] "fat bodi"              "fat fast"             
## [27] "fat need"              "for u"                
## [29] "gain caus"             "gonna end"            
## [31] "guy ageif"             "heart att<U+0085>"           
## [33] "inhibit caus"          "kimk bmi"             
## [35] "letter ask"            "life onli"            
## [37] "like kimk"             "listen live"          
## [39] "live nowbesttalkradio" "lose bodi"            
## [41] "lose control"          "lose weight"          
## [43] "need lose"             "old boy"              
## [45] "onli thing"            "overweight like"      
## [47] "overweight sister"     "percept life"         
## [49] "person<U+0092> brain"         "risk sudden"          
## [51] "santa stop"            "schedul percept"      
## [53] "school <U+0085>"              "simpl step"           
## [55] "sister school"         "sleep schedul"        
## [57] "start control"         "step to"              
## [59] "stop bulli"            "stuck around"         
## [61] "thing stuck"           "to lose"              
## [63] "u dont"                "u gonna"              
## [65] "u guy"                 "weight control"       
## [67] "weight gain"           "weight loss"          
## [69] "weight u"              "weight will"          
## [71] "when control"          "will chang"           
## [73] "wish donat"            "wrote letter"         
## [75] "year old"              "yr weight"

head(sort(topwords, decreasing = TRUE))

## [1] 3714 1761 1413 1165  983  980

Conclusion

There was a significant relationship between the BMI, hypertension and high cholesterol. Finding data for this project was a little bit hard. My plan is to use the same approach (automatic data collection) for discovering how to deal with obesity in term of weight loss.