Obesity in the United States has been increasingly cited as a major health issue in recent decades, resulting in diseases such as coronary heart disease that lead to mortality.While many industrialized countries have experienced similar increases, obesity rates in the United States are among the highest in the world.
Obesity has continued to grow within the United States. Two out of every three Americans are considered to be overweight or obese. The United States contains one of the highest percentage of obese people in the world. Obesity has led to over 120,000 preventable deaths each year in the United States. An obese person in America incurs an average of 1,429 Dollars more in medical expenses annually. Approximately 147 billion Dollars is spent in added medical expenses per year within the United States.This number is expected to increase approximately 1.24 billion Dollars per year until the year 2030.
The main aim of this project is to study the states which had the most Obese Population amongst Adults and Children as well as Teens in USA.Secondly,another objective of this project is to learn how to scrape data in R from an HTML page using rvest package and generate beautiful maps using ggplot and maps package in R.
require(rvest)
#rvest is the package to scrape Web pages in R
require(ggplot2)
require(dplyr)
require(scales)
require(maps)
require(mapproj)
#Loading the Data--------------
obesity<-read_html("https://en.wikipedia.org/wiki/Obesity_in_the_United_States")
#html_nodes() to select a particular HTML element from the above page
#Converting to a R dataframe
#xpath of the Wikipedia table data
obesity = obesity %>%
html_nodes(xpath='//*[@id="mw-content-text"]/div/table[2]') %>%
.[[1]] %>%
html_table(fill=T)
head(obesity)
## State and District of Columbia Obese adults
## 1 Alabama 30.1%
## 2 Alaska 27.3%
## 3 Arizona 23.3%
## 4 Arkansas 28.1%
## 5 California 23.1%
## 6 Colorado 21.0%
## Overweight (incl. obese) adults Obese children and adolescents
## 1 65.4% 16.7%
## 2 64.5% 11.1%
## 3 59.5% 12.2%
## 4 64.7% 16.4%
## 5 59.4% 13.2%
## 6 55.0% 9.9%
## Obesity rank
## 1 3
## 2 14
## 3 40
## 4 9
## 5 41
## 6 51
#Cleaning the Data
str(obesity)
## 'data.frame': 51 obs. of 5 variables:
## $ State and District of Columbia : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ Obese adults : chr "30.1%" "27.3%" "23.3%" "28.1%" ...
## $ Overweight (incl. obese) adults: chr "65.4%" "64.5%" "59.5%" "64.7%" ...
## $ Obese children and adolescents : chr "16.7%" "11.1%" "12.2%" "16.4%" ...
## $ Obesity rank : int 3 14 40 9 41 51 49 43 22 39 ...
#removing the % and making the data numeric
for(i in 2:4){
obesity[,i] = gsub("%", "", obesity[,i])
obesity[,i] = as.numeric(obesity[,i])
}
str(obesity)
## 'data.frame': 51 obs. of 5 variables:
## $ State and District of Columbia : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ Obese adults : num 30.1 27.3 23.3 28.1 23.1 21 20.8 22.1 25.9 23.3 ...
## $ Overweight (incl. obese) adults: num 65.4 64.5 59.5 64.7 59.4 55 58.7 55 63.9 60.8 ...
## $ Obese children and adolescents : num 16.7 11.1 12.2 16.4 13.2 9.9 12.3 14.8 22.8 14.4 ...
## $ Obesity rank : int 3 14 40 9 41 51 49 43 22 39 ...
#Fixing the names to remove spaces
names(obesity)
## [1] "State and District of Columbia" "Obese adults"
## [3] "Overweight (incl. obese) adults" "Obese children and adolescents"
## [5] "Obesity rank"
names(obesity) = make.names(names(obesity))
names(obesity)
## [1] "State.and.District.of.Columbia" "Obese.adults"
## [3] "Overweight..incl..obese..adults" "Obese.children.and.adolescents"
## [5] "Obesity.rank"
#Loading the map data-----------------
states = map_data("state")
# create a new variable region for state
obesity$region = tolower(obesity$State.and.District.of.Columbia)
#merging the datasets
states = merge(states, obesity, by="region", all.x=T)
str(states)
## 'data.frame': 15537 obs. of 11 variables:
## $ region : chr "alabama" "alabama" "alabama" "alabama" ...
## $ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ...
## $ lat : num 30.4 30.4 30.4 30.3 30.3 ...
## $ group : num 1 1 1 1 1 1 1 1 1 1 ...
## $ order : int 1 2 3 4 5 6 7 8 9 10 ...
## $ subregion : chr NA NA NA NA ...
## $ State.and.District.of.Columbia : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ Obese.adults : num 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 ...
## $ Overweight..incl..obese..adults: num 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 ...
## $ Obese.children.and.adolescents : num 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 ...
## $ Obesity.rank : int 3 3 3 3 3 3 3 3 3 3 ...
#for adding Names to the states in the Map- making a new data frame
statenames = states %>%
group_by(region) %>%
summarise(
long = mean(range(long)),
lat = mean(range(lat)),
group = mean(group),
Obese.adults = mean(Obese.adults),
Obese.children.and.adolescents = mean(Obese.children.and.adolescents)
)
## Warning: package 'bindrcpp' was built under R version 3.4.0
#Data frame consisting of top 10 Most Obese Adults States
topstate = states %>%
group_by(region) %>%
summarise(
Obese.adults = mean(Obese.adults),
Obese.children.and.adolescents = mean(Obese.children.and.adolescents)
) %>%
arrange(desc(Obese.adults)) %>%
top_n(10)
## Selecting by Obese.children.and.adolescents
#Plotting the top 10 states
ggplot(aes(x = reorder(region,Obese.adults), y = Obese.adults),data = topstate) +
geom_col(color="black",fill="#1EDBC2",alpha=0.6) +
labs(y = "Percentage of Obese Adults",x="Top 10 States") +
coord_flip()
The State with Highest Obese Adult Population is Mississippi.
#Plotting the data on a map------------------------
#For adults
ggplot(states, aes(x = long, y = lat, group = group, fill = Obese.adults)) +
geom_polygon(color = "white",show.legend = T) +
scale_fill_gradient(name = "Percent", low = "#FAB8D2", high = "#F91C74", guide = "colorbar", na.value="black", breaks = pretty_breaks(n = 5)) +
labs(title="Obesity in Adults for USA",x = "Longitude",y = "Latitude") +
coord_map() +
#adding States names to the states on the map
geom_text(data=statenames, aes(x = long, y = lat, label = region), size=3)
which.min(x = statenames$Obese.adults)
## [1] 6
#Now Analyzing the Obese Children and Teens
#Finding top 15 States with Most Obese Children and Teens
topChild = states %>%
group_by(region) %>%
summarise(Obese.Child.and.Teens = mean(Obese.children.and.adolescents)) %>%
top_n(15)
## Selecting by Obese.Child.and.Teens
#Barplot
ggplot(data = topChild, aes(x = reorder(region,Obese.Child.and.Teens), y = Obese.Child.and.Teens))+
geom_col(color="black",fill="#6EE543",alpha=0.8) +
coord_flip()
#Map for Obesity in Children
ggplot(states, aes(x = long, y = lat, group = group, fill = Obese.children.and.adolescents)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Percent Obese", low = "#B8D5EC", high = "#0A4B7D", guide = "colorbar", na.value="black", breaks = pretty_breaks(n = 5)) +
labs(title="Obesity in Children and Teens", x = "Longitude",y = "latitude") +
coord_map() +
#adding States names to the states on the map
geom_text(data=statenames, aes(x = long, y = lat, label = region), size=3)
#Most Obese Children and Teens for Delaware State
#Barplot
ggplot(aes(x = reorder(region,Obese.children.and.adolescents),y = Obese.children.and.adolescents),
data = statenames) +
geom_col(color="black",fill="#F43E3E",width=1) +
coord_flip() +
labs(x = "States", y ="Percentage of Obese Children and Teens",title="Barplot of Obese Children and Teens")
The State with Highest Obese Teen and Children population is Delaware.
In this project we firstly learned to scrape data using ‘rvest’ package from wikipedia and then analyzed and visualized the States with most Obese Adult and children population.We also learned how to create beautifull maps using ‘ggplot’ and ‘maps’ packages in R.
Hope you guys likes the article.