In this project we are interested in Obesity and health risk factors associated to it, which are huge. Knowing that obesity is not all about aesthetics and given all the tools and techniques we have learned in class, we will like to learn more about that topic. Our objective is to understand avaiable facts and data relevant to the topic. BMI (body mass index), Cholesterol, and hypertension are some of the variables used in risk factors datasets. We will spent some time exploring the correlations between those variables and see for example if BMI is a good predictor of hypertension and hyperlipidia. Our analysis will be based on tree differents data sources.
library(NHANES)
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.2.3
library(party)
## Warning: package 'party' was built under R version 3.2.3
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 3.2.3
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Loading required package: sandwich
library(partykit)
## Warning: package 'partykit' was built under R version 3.2.3
##
## Attaching package: 'partykit'
##
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner,
## node_surv, node_terminal
library(tm)
## Loading required package: NLP
library(SnowballC)
library(stringr)
##
## Attaching package: 'stringr'
##
## The following object is masked from 'package:strucchange':
##
## boundary
library(RTextTools)
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
##
## The following object is masked from 'package:base':
##
## backsolve
##
##
## Attaching package: 'RTextTools'
##
## The following objects are masked from 'package:SnowballC':
##
## getStemLanguages, wordStem
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
library(GGally)
##
## Attaching package: 'GGally'
##
## The following object is masked from 'package:dplyr':
##
## nasa
#The following function will take care of the rows with NAs
delete.na <- function(dframe, n=0) {
part <- apply(dframe, 2, is.na)
partindex <- apply(part, 1, function(x) sum(x) <= n)
dframe[partindex, ]
}
#The function below will be used for social media text mining
preprocessing <- function(tweets, nbgrams)
{
#Creating the coupus from the directory files
the_corpus <- Corpus(VectorSource(tweets))
#Removing punctuation
the_corpus <- tm_map(the_corpus, removePunctuation)
#Removing numbers
the_corpus <- tm_map(the_corpus, removeNumbers)
the_corpus <- tm_map(the_corpus, stemDocument)
#Removing words that usually have no analytic value
the_corpus <- tm_map(the_corpus, removeWords, stopwords("english"))
#Removing white spaces
the_corpus <- tm_map(the_corpus, stripWhitespace)
the_corpus <- tm_map(the_corpus, content_transformer(tolower))
the_corpus <- tm_map(the_corpus, PlainTextDocument)
options(mc.cores=1)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = nbgrams, max = nbgrams)) # create n-grams using RWEKA
tdm <- TermDocumentMatrix(the_corpus, control = list(tokenize = BigramTokenizer, weighting = weightTf)) # create tdm from n-grams
tdm
}
We spent some time trying to figure out how to do the download process automatically but it did not work. We tried two different packages: gdata and httr without success.
The data initially was manually downloaded from Gapminder http://www.gapminder.org/data/
setwd("c:/data/is607")
fileUrl1 <- "https://raw.githubusercontent.com/fangseup88/finals/master/data_finals/Indicator_BMIfemale.csv"
fileUrl2 <- "https://raw.githubusercontent.com/fangseup88/finals/master/data_finals/Indicator_BMImale.csv"
fileUrl3 <- "https://raw.githubusercontent.com/fangseup88/finals/master/data_finals/Indicator_SBPfemale.csv"
fileUrl4 <- "https://raw.githubusercontent.com/fangseup88/finals/master/data_finals/Indicator_SBPmale.csv"
fileUrl5 <- "https://raw.githubusercontent.com/fangseup88/finals/master/data_finals/Indicator_TCfemale.csv"
fileUrl6 <- "https://raw.githubusercontent.com/fangseup88/finals/master/data_finals/Indicator_TCmale.csv"
download.file(fileUrl1, destfile="Indicator_BMIfemale.csv", mode="wb")
download.file(fileUrl2, destfile="Indicator_BMImale.csv", mode="wb")
download.file(fileUrl3, destfile="Indicator_SBPfemale.csv", mode="wb")
download.file(fileUrl4, destfile="Indicator_SBPmale.csv", mode="wb")
download.file(fileUrl5, destfile="Indicator_TCfemale.csv", mode="wb")
download.file(fileUrl6, destfile="Indicator_TCmale.csv", mode="wb")
BMIdataMale = read.csv("Indicator_BMImale.csv", header=TRUE, check.names = FALSE, stringsAsFactors=F)
BMIdataFemale = read.csv("Indicator_BMIfemale.csv", header=TRUE, check.names = FALSE, stringsAsFactors=F)
BPdataMale=read.csv("Indicator_SBPmale.csv", header=TRUE, check.names = FALSE, stringsAsFactors=F)
BPdataFemale=read.csv("Indicator_SBPfemale.csv", header=TRUE, check.names = FALSE, stringsAsFactors=F)
CHOLdataMale=read.csv("Indicator_TCmale.csv", header=TRUE, check.names = FALSE, stringsAsFactors=F)
CHOLdataFemale=read.csv("Indicator_TCfemale.csv", header=TRUE, check.names = FALSE, stringsAsFactors=F)
#Transforming the data from wide format to long format
bmiM <- BMIdataMale %>% gather("Year","BMI", 2:30) %>% group_by(Year)
bmiF <- BMIdataFemale %>% gather("Year","BMI", 2:30) %>% group_by(Year)
bpM <- BPdataMale %>% gather("Year","BP", 2:30)%>% group_by(Year)
bpF <- BPdataFemale %>% gather("Year","BP", 2:30)%>% group_by(Year)
cholM <- CHOLdataMale %>% gather("Year","CHOL", 2:30)%>% group_by(Year)
cholF <- CHOLdataFemale %>% gather("Year","CHOL", 2:30)%>% group_by(Year)
#Merging the data and adding Gender information
females_init <- merge(bmiF, bpF, by=c("Year", "Country"))
females <- merge(females_init, cholF, by=c("Year", "Country"))
head(females)
## Year Country BMI BP CHOL
## 1 1980 Afghanistan 20.44348 122.0799 4.644476
## 2 1980 Albania 25.17427 132.2048 5.039529
## 3 1980 Algeria 23.67764 130.8334 4.976215
## 4 1980 Andorra 25.67324 136.6040 6.132187
## 5 1980 Angola 20.06763 130.0896 4.789354
## 6 1980 Antigua and Barbuda 24.22235 125.6712 5.013913
females$Gender <- 'F'
str(females)
## 'data.frame': 5771 obs. of 6 variables:
## $ Year : Factor w/ 29 levels "1980","1981",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Country: chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
## $ BMI : num 20.4 25.2 23.7 25.7 20.1 ...
## $ BP : num 122 132 131 137 130 ...
## $ CHOL : num 4.64 5.04 4.98 6.13 4.79 ...
## $ Gender : chr "F" "F" "F" "F" ...
females$Gender <- as.factor(females$Gender)
males_init <- merge(bmiM, bpM, by=c("Year", "Country"))
males <- merge(males_init, cholM, by=c("Year", "Country"))
head(males)
## Year Country BMI BP CHOL
## 1 1980 Afghanistan 21.48678 125.1991 4.582847
## 2 1980 Albania 25.22533 132.9270 5.006371
## 3 1980 Algeria 22.25703 132.4093 4.925933
## 4 1980 Andorra 25.66652 140.8585 6.178972
## 5 1980 Angola 20.94876 135.1761 4.524107
## 6 1980 Antigua and Barbuda 23.31424 132.5722 4.868222
males$Gender <- 'M'
str(males)
## 'data.frame': 5771 obs. of 6 variables:
## $ Year : Factor w/ 29 levels "1980","1981",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Country: chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
## $ BMI : num 21.5 25.2 22.3 25.7 20.9 ...
## $ BP : num 125 133 132 141 135 ...
## $ CHOL : num 4.58 5.01 4.93 6.18 4.52 ...
## $ Gender : chr "M" "M" "M" "M" ...
males$Gender <- as.factor(males$Gender)
allData<-rbind(males, females)
str(allData)
## 'data.frame': 11542 obs. of 6 variables:
## $ Year : Factor w/ 29 levels "1980","1981",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Country: chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
## $ BMI : num 21.5 25.2 22.3 25.7 20.9 ...
## $ BP : num 125 133 132 141 135 ...
## $ CHOL : num 4.58 5.01 4.93 6.18 4.52 ...
## $ Gender : Factor w/ 2 levels "M","F": 1 1 1 1 1 1 1 1 1 1 ...
# Remove any empty rows
allData<-allData[!is.na(allData$BMI),]
allData<-allData[!is.na(allData$Country),]
library(psych)
##
## Attaching package: 'psych'
##
## The following object is masked from 'package:ggplot2':
##
## %+%
describe(allData)
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
## vars n mean sd median trimmed mad min max range
## Year* 1 11542 15.00 8.37 15.00 15.00 10.38 1.00 29.00 28.00
## Country* 2 11542 NaN NA NA NaN NA Inf -Inf -Inf
## BMI 3 11542 24.34 2.49 24.72 24.31 2.45 18.47 35.02 16.55
## BP 4 11542 129.11 4.51 129.39 129.18 4.72 110.34 143.12 32.79
## CHOL 5 11542 4.85 0.48 4.83 4.84 0.51 3.73 6.24 2.52
## Gender* 6 11542 1.50 0.50 1.50 1.50 0.74 1.00 2.00 1.00
## skew kurtosis se
## Year* 0.00 -1.20 0.08
## Country* NA NA NA
## BMI 0.15 0.01 0.02
## BP -0.19 -0.13 0.04
## CHOL 0.25 -0.41 0.00
## Gender* 0.00 -2.00 0.00
str(allData)
## 'data.frame': 11542 obs. of 6 variables:
## $ Year : Factor w/ 29 levels "1980","1981",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Country: chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
## $ BMI : num 21.5 25.2 22.3 25.7 20.9 ...
## $ BP : num 125 133 132 141 135 ...
## $ CHOL : num 4.58 5.01 4.93 6.18 4.52 ...
## $ Gender : Factor w/ 2 levels "M","F": 1 1 1 1 1 1 1 1 1 1 ...
allData$bmiClass <- cut(allData$BMI, breaks = c(0,18.5,25,30,50), labels = c('Underweight', 'Healthy', 'Overweight', 'Obese'))
allData$bpClass <- cut(allData$BP, breaks = c(0,110,120,140,160), labels = c('Normal', 'Pre-HBP', 'Stage1', 'Stage2'))
ggplot(aes(x = Year, y = BMI, color = bmiClass, pch = Gender),
data = allData) +
geom_jitter(alpha = 1) +
scale_x_discrete(breaks = seq(1980,2008,2)) +
scale_colour_brewer(palette=1) + xlab("Years") + ylab("Mean Body Mass Index") + ggtitle('Body Mass Index Evolution in the World Per Year\n')
It is showing that the BMI is going up.
ggplot(aes(x = Year, y = BP, color = bpClass, pch = Gender),
data = allData) +
geom_jitter(alpha = 1) +
scale_x_discrete(breaks = seq(1980,2008,2)) +
scale_colour_brewer(palette=3) + xlab("Years") + ylab("Mean Blood Pressure") + ggtitle('Blood Pressure Evolution in the World Per Year\n')
The blood pressure went down slightly and stayed almost at the same level.
pairs.panels(allData)
ggparcoord(data=allData, columns = c("BMI", "CHOL", "BP"), groupColumn = "bpClass", order = "allClass", scale="uniminmax")
ggparcoord(data=allData, columns = c("BMI", "CHOL", "BP"), groupColumn = "bmiClass", order = "allClass", scale="uniminmax")
library(survey)
## Warning: package 'survey' was built under R version 3.2.3
##
## Attaching package: 'survey'
##
## The following object is masked from 'package:graphics':
##
## dotchart
library(NHANES)
data(NHANESraw)
#Subsetting the initial dataset
subnhanes <- subset(NHANESraw, select=c(SDMVPSU, SDMVSTRA, Gender, Age, Weight, Height, BMI, Diabetes, BPSysAve, BPDiaAve, DirectChol, TotChol))
#Creating a factor variable for BMI classification
subnhanes$bmiClass <- cut(subnhanes$BMI, breaks = c(0,18.5,25,30,50), labels = c('Underweight', 'Healthy', 'Overweight', 'Obese'))
#Creating a factor variable for Blood Pressure classification based on two variables
subnhanes$bpClass <- ifelse((subnhanes$BPSysAve>=160 & subnhanes$BPDiaAve>=100),"Stage 2 HBP",ifelse((subnhanes$BPSysAve>=140 & subnhanes$BPDiaAve>=90),"Stage 1 HBP",ifelse((subnhanes$BPSysAve>=120 & subnhanes$BPDiaAve>=80),"Pre-HBP","Normal HBP")))
names(subnhanes)
## [1] "SDMVPSU" "SDMVSTRA" "Gender" "Age" "Weight"
## [6] "Height" "BMI" "Diabetes" "BPSysAve" "BPDiaAve"
## [11] "DirectChol" "TotChol" "bmiClass" "bpClass"
subnhanes$bpClass <-as.factor(subnhanes$bpClass)
#Taking care of missing values
dim(subnhanes)
## [1] 20293 14
subnhanes<-delete.na(subnhanes)
dim(subnhanes)
## [1] 13409 14
#Survey Design
bmi_design <- svydesign(id = ~ SDMVPSU, strata = ~ SDMVSTRA, nest = TRUE, weight = ~ BMI, data = subnhanes)
chol_design<- svydesign(id = ~ SDMVPSU, strata = ~ SDMVSTRA, nest = TRUE, weight = ~ TotChol, data = subnhanes)
bmi_design
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (62) clusters.
## svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, nest = TRUE, weight = ~BMI,
## data = subnhanes)
chol_design
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (62) clusters.
## svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, nest = TRUE, weight = ~TotChol,
## data = subnhanes)
#Predicting the probability of being diagnosed with hypertension based on age,bmi Diabetes, Weight
m1<-svyglm(bpClass~Age+bmiClass+Weight+TotChol+Diabetes, design=bmi_design, family=quasibinomial)
#Is BMI a good predictor of hypertension and hyperlipidemia?
m2<-svyglm(bmiClass~Age+TotChol+bpClass+Weight+Diabetes, design=bmi_design, family=quasibinomial)
summary(m1)
##
## Call:
## svyglm(formula = bpClass ~ Age + bmiClass + Weight + TotChol +
## Diabetes, design = bmi_design, family = quasibinomial)
##
## Survey design:
## svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, nest = TRUE, weight = ~BMI,
## data = subnhanes)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.972084 0.301616 -23.116 < 2e-16 ***
## Age 0.015050 0.001594 9.444 6.89e-10 ***
## bmiClassHealthy 0.592158 0.272711 2.171 0.0392 *
## bmiClassOverweight 0.629760 0.283781 2.219 0.0354 *
## bmiClassObese 0.517760 0.288699 1.793 0.0845 .
## Weight 0.023768 0.001945 12.223 2.78e-12 ***
## TotChol 0.378347 0.022981 16.464 2.86e-15 ***
## DiabetesYes -0.126422 0.098191 -1.288 0.2093
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 0.9155729)
##
## Number of Fisher Scoring iterations: 6
summary(m2)
##
## Call:
## svyglm(formula = bmiClass ~ Age + TotChol + bpClass + Weight +
## Diabetes, design = bmi_design, family = quasibinomial)
##
## Survey design:
## svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, nest = TRUE, weight = ~BMI,
## data = subnhanes)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.909339 0.380854 -28.644 < 2e-16 ***
## Age -0.010222 0.002473 -4.133 0.00033 ***
## TotChol 0.455146 0.070693 6.438 8.03e-07 ***
## bpClassPre-HBP 0.150515 0.379206 0.397 0.69466
## bpClassStage 1 HBP -0.452925 0.739555 -0.612 0.54557
## bpClassStage 2 HBP -2.715966 0.753124 -3.606 0.00129 **
## Weight 0.222438 0.005708 38.972 < 2e-16 ***
## DiabetesYes 0.772122 0.298154 2.590 0.01553 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 0.2325679)
##
## Number of Fisher Scoring iterations: 9
library(party)
library(partykit)
if(require("party", quietly = TRUE)) plot(m1)
if(require("party", quietly = TRUE)) plot(m2)
According to our multivariate logistic regression analysis, the BMI seems to be significantly associated with blood pressure and the Cholesterol level.
There was a significant relationship between the BMI, hypertension and high cholesterol. Finding data for this project was a little bit hard. My plan is to use the same approach (automatic data collection) for discovering how to deal with obesity in term of weight loss.