Obesity in the United States has been increasingly cited as a major health issue in recent decades, resulting in diseases such as coronary heart disease that lead to mortality.While many industrialized countries have experienced similar increases, obesity rates in the United States are among the highest in the world.

Obesity has continued to grow within the United States. Two out of every three Americans are considered to be overweight or obese. The United States contains one of the highest percentage of obese people in the world. Obesity has led to over 120,000 preventable deaths each year in the United States. An obese person in America incurs an average of 1,429 Dollars more in medical expenses annually. Approximately 147 billion Dollars is spent in added medical expenses per year within the United States.This number is expected to increase approximately 1.24 billion Dollars per year until the year 2030.


Aim of The project

The main aim of this project is to study the states which had the most Obese Population amongst Adults and Children as well as Teens in USA.Secondly,another objective of this project is to learn how to scrape data in R from an HTML page using rvest package and generate beautiful maps using ggplot and maps package in R.


Loading the required Packages

require(rvest)
#rvest is the package to scrape Web pages in R

require(ggplot2)
require(dplyr)
require(scales)
require(maps)
require(mapproj)

Now Scraping the HTML page and converting it to a R Data frame

#Loading the Data--------------

obesity<-read_html("https://en.wikipedia.org/wiki/Obesity_in_the_United_States")


#html_nodes() to select a particular HTML element from the above page
#Converting to a R dataframe
#xpath of the Wikipedia table data
obesity = obesity %>%
  html_nodes(xpath='//*[@id="mw-content-text"]/div/table[2]') %>% 
  .[[1]] %>%
  html_table(fill=T)


head(obesity)
##   State and District of Columbia Obese adults
## 1                        Alabama        30.1%
## 2                         Alaska        27.3%
## 3                        Arizona        23.3%
## 4                       Arkansas        28.1%
## 5                     California        23.1%
## 6                       Colorado        21.0%
##   Overweight (incl. obese) adults Obese children and adolescents
## 1                           65.4%                          16.7%
## 2                           64.5%                          11.1%
## 3                           59.5%                          12.2%
## 4                           64.7%                          16.4%
## 5                           59.4%                          13.2%
## 6                           55.0%                           9.9%
##   Obesity rank
## 1            3
## 2           14
## 3           40
## 4            9
## 5           41
## 6           51
#Cleaning the Data 
str(obesity)
## 'data.frame':    51 obs. of  5 variables:
##  $ State and District of Columbia : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ Obese adults                   : chr  "30.1%" "27.3%" "23.3%" "28.1%" ...
##  $ Overweight (incl. obese) adults: chr  "65.4%" "64.5%" "59.5%" "64.7%" ...
##  $ Obese children and adolescents : chr  "16.7%" "11.1%" "12.2%" "16.4%" ...
##  $ Obesity rank                   : int  3 14 40 9 41 51 49 43 22 39 ...
#removing the % and making the data numeric

for(i in 2:4){
  obesity[,i] = gsub("%", "", obesity[,i])
  obesity[,i] = as.numeric(obesity[,i])
}

str(obesity)
## 'data.frame':    51 obs. of  5 variables:
##  $ State and District of Columbia : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ Obese adults                   : num  30.1 27.3 23.3 28.1 23.1 21 20.8 22.1 25.9 23.3 ...
##  $ Overweight (incl. obese) adults: num  65.4 64.5 59.5 64.7 59.4 55 58.7 55 63.9 60.8 ...
##  $ Obese children and adolescents : num  16.7 11.1 12.2 16.4 13.2 9.9 12.3 14.8 22.8 14.4 ...
##  $ Obesity rank                   : int  3 14 40 9 41 51 49 43 22 39 ...
#Fixing the names to remove spaces

names(obesity)
## [1] "State and District of Columbia"  "Obese adults"                   
## [3] "Overweight (incl. obese) adults" "Obese children and adolescents" 
## [5] "Obesity rank"
names(obesity) = make.names(names(obesity))
names(obesity)
## [1] "State.and.District.of.Columbia"  "Obese.adults"                   
## [3] "Overweight..incl..obese..adults" "Obese.children.and.adolescents" 
## [5] "Obesity.rank"

Now loading the Map data for USA and merging it with Obesity Data frame

#Loading the map data-----------------

states = map_data("state")

# create a new variable region for state
obesity$region = tolower(obesity$State.and.District.of.Columbia)

#merging the datasets
states = merge(states, obesity, by="region", all.x=T)
str(states)
## 'data.frame':    15537 obs. of  11 variables:
##  $ region                         : chr  "alabama" "alabama" "alabama" "alabama" ...
##  $ long                           : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
##  $ lat                            : num  30.4 30.4 30.4 30.3 30.3 ...
##  $ group                          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ order                          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ subregion                      : chr  NA NA NA NA ...
##  $ State.and.District.of.Columbia : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ Obese.adults                   : num  30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 ...
##  $ Overweight..incl..obese..adults: num  65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 ...
##  $ Obese.children.and.adolescents : num  16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 ...
##  $ Obesity.rank                   : int  3 3 3 3 3 3 3 3 3 3 ...

Plotting the States with Most Obese Adults

#for adding Names to the states in the Map- making a new data frame

statenames = states %>% 
  group_by(region) %>%
  summarise(
    long = mean(range(long)), 
    lat = mean(range(lat)), 
    group = mean(group), 
    Obese.adults = mean(Obese.adults), 
    Obese.children.and.adolescents = mean(Obese.children.and.adolescents)
  )
## Warning: package 'bindrcpp' was built under R version 3.4.0
#Data frame consisting of top 10 Most Obese Adults States 
topstate = states %>% 
  group_by(region) %>%
  summarise(
    
    Obese.adults = mean(Obese.adults), 
    Obese.children.and.adolescents = mean(Obese.children.and.adolescents)
    
  ) %>%
  arrange(desc(Obese.adults)) %>%
  top_n(10)
## Selecting by Obese.children.and.adolescents
#Plotting the top 10 states 

ggplot(aes(x = reorder(region,Obese.adults), y = Obese.adults),data = topstate) + 
  geom_col(color="black",fill="#1EDBC2",alpha=0.6) +
  labs(y = "Percentage of Obese Adults",x="Top 10 States") +
  coord_flip()

The State with Highest Obese Adult Population is Mississippi.


Plotting a Map for Most Obese Adult Population

#Plotting the data on a map------------------------


#For adults

ggplot(states, aes(x = long, y = lat, group = group, fill = Obese.adults)) + 
  geom_polygon(color = "white",show.legend = T) +
  scale_fill_gradient(name = "Percent", low = "#FAB8D2", high = "#F91C74", guide = "colorbar", na.value="black", breaks = pretty_breaks(n = 5)) +
  labs(title="Obesity in Adults for USA",x = "Longitude",y = "Latitude") +
  coord_map() +
  #adding States names to the states on the map
  geom_text(data=statenames, aes(x = long, y = lat, label = region), size=3)

which.min(x = statenames$Obese.adults)
## [1] 6

Now Analyzing the Obese Children and Teen Population

#Now Analyzing the Obese Children and Teens

#Finding top 15 States with Most Obese Children and Teens
topChild = states %>%
      group_by(region) %>%
      summarise(Obese.Child.and.Teens = mean(Obese.children.and.adolescents)) %>%
      top_n(15)
## Selecting by Obese.Child.and.Teens
#Barplot

ggplot(data = topChild, aes(x = reorder(region,Obese.Child.and.Teens), y = Obese.Child.and.Teens))+
        geom_col(color="black",fill="#6EE543",alpha=0.8) +
        coord_flip()

#Map for Obesity in Children 

ggplot(states, aes(x = long, y = lat, group = group, fill = Obese.children.and.adolescents)) + 
  geom_polygon(color = "white") +
  scale_fill_gradient(name = "Percent Obese", low = "#B8D5EC", high = "#0A4B7D", guide = "colorbar", na.value="black", breaks = pretty_breaks(n = 5)) +
  labs(title="Obesity in Children and Teens", x = "Longitude",y = "latitude") +
  coord_map() + 
  #adding States names to the states on the map
  geom_text(data=statenames, aes(x = long, y = lat, label = region), size=3)

#Most Obese Children and Teens for Delaware State

#Barplot

ggplot(aes(x = reorder(region,Obese.children.and.adolescents),y = Obese.children.and.adolescents),
           data = statenames) + 
  geom_col(color="black",fill="#F43E3E",width=1)   +
  coord_flip() + 
  labs(x = "States", y ="Percentage of Obese Children and Teens",title="Barplot of Obese Children and Teens")

The State with Highest Obese Teen and Children population is Delaware.


Conclusion

In this project we firstly learned to scrape data using ‘rvest’ package from wikipedia and then analyzed and visualized the States with most Obese Adult and children population.We also learned how to create beautifull maps using ‘ggplot’ and ‘maps’ packages in R.

Hope you guys likes the article.