title: “Nutrition, Physical Activity and Obesity Data Analysis using R” output: html_document ——————–
This report analyzes the Nutrition, Physical Activity and Obesity dataset using R programming. The purpose of this analysis is to inspect the dataset, clean and manipulate data, create new variables, calculate summary statistics, and visualize relationships between selected variables.
nutrition <- readxl::read_excel(
"D:/George_Brown_2/programming/a1/Nutrition.xlsx"
)
str(nutrition)
## tibble [88,629 × 33] (S3: tbl_df/tbl/data.frame)
## $ YearStart : num [1:88629] 2020 2014 2013 2013 2015 ...
## $ YearEnd : num [1:88629] 2020 2014 2013 2013 2015 ...
## $ LocationAbbr : chr [1:88629] "US" "GU" "US" "US" ...
## $ LocationDesc : chr [1:88629] "National" "Guam" "National" "National" ...
## $ Datasource : chr [1:88629] "Behavioral Risk Factor Surveillance System" "Behavioral Risk Factor Surveillance System" "Behavioral Risk Factor Surveillance System" "Behavioral Risk Factor Surveillance System" ...
## $ Class : chr [1:88629] "Physical Activity" "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" ...
## $ Topic : chr [1:88629] "Physical Activity - Behavior" "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" ...
## $ Question : chr [1:88629] "Percent of adults who engage in no leisure-time physical activity" "Percent of adults aged 18 years and older who have obesity" "Percent of adults aged 18 years and older who have obesity" "Percent of adults aged 18 years and older who have an overweight classification" ...
## $ Data_Value_Unit : logi [1:88629] NA NA NA NA NA NA ...
## $ Data_Value_Type : chr [1:88629] "Value" "Value" "Value" "Value" ...
## $ Data_Value : num [1:88629] 30.6 29.3 28.8 32.7 26.6 27.4 48.5 31.6 38.1 35.2 ...
## $ Data_Value_Alt : num [1:88629] 30.6 29.3 28.8 32.7 26.6 27.4 48.5 31.6 38.1 35.2 ...
## $ Data_Value_Footnote_Symbol: chr [1:88629] NA NA NA NA ...
## $ Data_Value_Footnote : chr [1:88629] NA NA NA NA ...
## $ Low_Confidence_Limit : num [1:88629] 29.4 25.7 28.1 31.9 25.6 18.6 32.3 24 32.6 30.7 ...
## $ High_Confidence_Limit : num [1:88629] 31.8 33.3 29.5 33.5 27.6 38.5 64.9 40.4 43.8 40 ...
## $ Sample_Size : num [1:88629] 31255 842 62562 60069 30904 ...
## $ Total : chr [1:88629] NA NA NA NA ...
## $ Age(years) : chr [1:88629] NA NA NA NA ...
## $ Education : chr [1:88629] NA "High school graduate" NA NA ...
## $ Gender : chr [1:88629] NA NA NA NA ...
## $ Income : chr [1:88629] NA NA "$50,000 - $74,999" "Data not reported" ...
## $ Race/Ethnicity : chr [1:88629] "Hispanic" NA NA NA ...
## $ GeoLocation : chr [1:88629] NA "(13.444304, 144.793731)" NA NA ...
## $ ClassID : chr [1:88629] "PA" "OWS" "OWS" "OWS" ...
## $ TopicID : chr [1:88629] "PA1" "OWS1" "OWS1" "OWS1" ...
## $ QuestionID : chr [1:88629] "Q047" "Q036" "Q036" "Q037" ...
## $ DataValueTypeID : chr [1:88629] "VALUE" "VALUE" "VALUE" "VALUE" ...
## $ LocationID : num [1:88629] 59 66 59 59 59 66 56 11 72 1 ...
## $ StratificationCategory1 : chr [1:88629] "Race/Ethnicity" "Education" "Income" "Income" ...
## $ Stratification1 : chr [1:88629] "Hispanic" "High school graduate" "$50,000 - $74,999" "Data not reported" ...
## $ StratificationCategoryId1 : chr [1:88629] "RACE" "EDU" "INC" "INC" ...
## $ StratificationID1 : chr [1:88629] "RACEHIS" "EDUHSGRAD" "INC5075" "INCNR" ...
names(nutrition)
## [1] "YearStart" "YearEnd"
## [3] "LocationAbbr" "LocationDesc"
## [5] "Datasource" "Class"
## [7] "Topic" "Question"
## [9] "Data_Value_Unit" "Data_Value_Type"
## [11] "Data_Value" "Data_Value_Alt"
## [13] "Data_Value_Footnote_Symbol" "Data_Value_Footnote"
## [15] "Low_Confidence_Limit" "High_Confidence_Limit"
## [17] "Sample_Size" "Total"
## [19] "Age(years)" "Education"
## [21] "Gender" "Income"
## [23] "Race/Ethnicity" "GeoLocation"
## [25] "ClassID" "TopicID"
## [27] "QuestionID" "DataValueTypeID"
## [29] "LocationID" "StratificationCategory1"
## [31] "Stratification1" "StratificationCategoryId1"
## [33] "StratificationID1"
head(nutrition, 15)
## # A tibble: 15 × 33
## YearStart YearEnd LocationAbbr LocationDesc Datasource Class Topic Question
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2020 2020 US National Behaviora… Phys… Phys… Percent…
## 2 2014 2014 GU Guam Behaviora… Obes… Obes… Percent…
## 3 2013 2013 US National Behaviora… Obes… Obes… Percent…
## 4 2013 2013 US National Behaviora… Obes… Obes… Percent…
## 5 2015 2015 US National Behaviora… Phys… Phys… Percent…
## 6 2015 2015 GU Guam Behaviora… Phys… Phys… Percent…
## 7 2012 2012 WY Wyoming Behaviora… Obes… Obes… Percent…
## 8 2012 2012 DC District of C… Behaviora… Obes… Obes… Percent…
## 9 2015 2015 PR Puerto Rico Behaviora… Phys… Phys… Percent…
## 10 2011 2011 AL Alabama Behaviora… Obes… Obes… Percent…
## 11 2015 2015 GU Guam Behaviora… Phys… Phys… Percent…
## 12 2015 2015 RI Rhode Island Behaviora… Obes… Obes… Percent…
## 13 2011 2011 US National Behaviora… Obes… Obes… Percent…
## 14 2012 2012 WY Wyoming Behaviora… Phys… Phys… Percent…
## 15 2020 2020 DE Delaware Behaviora… Phys… Phys… Percent…
## # ℹ 25 more variables: Data_Value_Unit <lgl>, Data_Value_Type <chr>,
## # Data_Value <dbl>, Data_Value_Alt <dbl>, Data_Value_Footnote_Symbol <chr>,
## # Data_Value_Footnote <chr>, Low_Confidence_Limit <dbl>,
## # High_Confidence_Limit <dbl>, Sample_Size <dbl>, Total <chr>,
## # `Age(years)` <chr>, Education <chr>, Gender <chr>, Income <chr>,
## # `Race/Ethnicity` <chr>, GeoLocation <chr>, ClassID <chr>, TopicID <chr>,
## # QuestionID <chr>, DataValueTypeID <chr>, LocationID <dbl>, …
Some column names were renamed to make them easier to use in R.
nutrition_clean <- nutrition %>%
rename(
year = YearStart,
location = LocationDesc,
data_value = Data_Value,
low_ci = Low_Confidence_Limit,
high_ci = High_Confidence_Limit,
sample_size = Sample_Size,
age_group = `Age(years)`,
race_ethnicity = `Race/Ethnicity`
)
names(nutrition_clean)
## [1] "year" "YearEnd"
## [3] "LocationAbbr" "location"
## [5] "Datasource" "Class"
## [7] "Topic" "Question"
## [9] "Data_Value_Unit" "Data_Value_Type"
## [11] "data_value" "Data_Value_Alt"
## [13] "Data_Value_Footnote_Symbol" "Data_Value_Footnote"
## [15] "low_ci" "high_ci"
## [17] "sample_size" "Total"
## [19] "age_group" "Education"
## [21] "Gender" "Income"
## [23] "race_ethnicity" "GeoLocation"
## [25] "ClassID" "TopicID"
## [27] "QuestionID" "DataValueTypeID"
## [29] "LocationID" "StratificationCategory1"
## [31] "Stratification1" "StratificationCategoryId1"
## [33] "StratificationID1"
The following function classifies numerical values into low, medium, and high categories.
classify_value <- function(x) {
if (x < 25) {
return("Low")
}
if (x < 40) {
return("Medium")
}
return("High")
}
classify_value(35)
## [1] "Medium"
sum(is.na(nutrition_clean))
## [1] 737063
colSums(is.na(nutrition_clean))
## year YearEnd
## 0 0
## LocationAbbr location
## 0 0
## Datasource Class
## 0 0
## Topic Question
## 0 0
## Data_Value_Unit Data_Value_Type
## 88629 0
## data_value Data_Value_Alt
## 8778 8778
## Data_Value_Footnote_Symbol Data_Value_Footnote
## 79851 79851
## low_ci high_ci
## 8778 8778
## sample_size Total
## 8778 85464
## age_group Education
## 69639 75969
## Gender Income
## 82299 66474
## race_ethnicity GeoLocation
## 63309 1652
## ClassID TopicID
## 0 0
## QuestionID DataValueTypeID
## 0 0
## LocationID StratificationCategory1
## 0 9
## Stratification1 StratificationCategoryId1
## 9 9
## StratificationID1
## 9
Complete-case removal was tested, but it removed all rows because each record contains at least one missing value in some field.
nutrition_no_na <- na.omit(nutrition_clean)
nrow(nutrition_clean)
## [1] 88629
nrow(nutrition_no_na)
## [1] 0
For the main analysis, only rows with missing data_value
were removed.
analysis_df <- nutrition_clean %>%
filter(!is.na(data_value))
nrow(analysis_df)
## [1] 79851
sum(is.na(analysis_df$data_value))
## [1] 0
sum(duplicated(nutrition_clean))
## [1] 0
No duplicated records were identified in the dataset.
Rows with data_value greater than 50 were filtered.
high_values <- analysis_df %>%
filter(data_value > 50)
nrow(high_values)
## [1] 3812
head(high_values)
## # A tibble: 6 × 33
## year YearEnd LocationAbbr location Datasource Class Topic Question
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2015 2015 PR Puerto Rico Behaviora… Phys… Phys… Percent…
## 2 2015 2015 US National Behaviora… Phys… Phys… Percent…
## 3 2017 2017 WA Washington Behaviora… Phys… Phys… Percent…
## 4 2013 2013 MN Minnesota Behaviora… Phys… Phys… Percent…
## 5 2015 2015 PR Puerto Rico Behaviora… Phys… Phys… Percent…
## 6 2019 2019 DC District of Columb… Behaviora… Phys… Phys… Percent…
## # ℹ 25 more variables: Data_Value_Unit <lgl>, Data_Value_Type <chr>,
## # data_value <dbl>, Data_Value_Alt <dbl>, Data_Value_Footnote_Symbol <chr>,
## # Data_Value_Footnote <chr>, low_ci <dbl>, high_ci <dbl>, sample_size <dbl>,
## # Total <chr>, age_group <chr>, Education <chr>, Gender <chr>, Income <chr>,
## # race_ethnicity <chr>, GeoLocation <chr>, ClassID <chr>, TopicID <chr>,
## # QuestionID <chr>, DataValueTypeID <chr>, LocationID <dbl>,
## # StratificationCategory1 <chr>, Stratification1 <chr>, …
The dependent variable selected for this analysis is
data_value.
Independent variables include year,
age_group, Education, Gender,
Income, and race_ethnicity.
model_df <- analysis_df %>%
select(
data_value,
year,
age_group,
Education,
Gender,
Income,
race_ethnicity
)
head(model_df)
## # A tibble: 6 × 7
## data_value year age_group Education Gender Income race_ethnicity
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 30.6 2020 <NA> <NA> <NA> <NA> Hispanic
## 2 29.3 2014 <NA> High school graduate <NA> <NA> <NA>
## 3 28.8 2013 <NA> <NA> <NA> $50,000… <NA>
## 4 32.7 2013 <NA> <NA> <NA> Data no… <NA>
## 5 26.6 2015 <NA> <NA> <NA> Less th… <NA>
## 6 27.4 2015 <NA> <NA> <NA> <NA> Hispanic
sorted_df <- analysis_df %>%
arrange(desc(data_value))
head(sorted_df)
## # A tibble: 6 × 33
## year YearEnd LocationAbbr location Datasource Class Topic Question
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2015 2015 OR Oregon Behavioral Risk Fa… Phys… Phys… Percent…
## 2 2011 2011 MT Montana Behavioral Risk Fa… Phys… Phys… Percent…
## 3 2011 2011 ID Idaho Behavioral Risk Fa… Phys… Phys… Percent…
## 4 2013 2013 NM New Mexico Behavioral Risk Fa… Phys… Phys… Percent…
## 5 2019 2019 WY Wyoming Behavioral Risk Fa… Phys… Phys… Percent…
## 6 2017 2017 NM New Mexico Behavioral Risk Fa… Phys… Phys… Percent…
## # ℹ 25 more variables: Data_Value_Unit <lgl>, Data_Value_Type <chr>,
## # data_value <dbl>, Data_Value_Alt <dbl>, Data_Value_Footnote_Symbol <chr>,
## # Data_Value_Footnote <chr>, low_ci <dbl>, high_ci <dbl>, sample_size <dbl>,
## # Total <chr>, age_group <chr>, Education <chr>, Gender <chr>, Income <chr>,
## # race_ethnicity <chr>, GeoLocation <chr>, ClassID <chr>, TopicID <chr>,
## # QuestionID <chr>, DataValueTypeID <chr>, LocationID <dbl>,
## # StratificationCategory1 <chr>, Stratification1 <chr>, …
A new variable was created by multiplying data_value by
2.
analysis_df <- analysis_df %>%
mutate(data_value_double = data_value * 2)
head(
analysis_df %>%
select(data_value, data_value_double)
)
## # A tibble: 6 × 2
## data_value data_value_double
## <dbl> <dbl>
## 1 30.6 61.2
## 2 29.3 58.6
## 3 28.8 57.6
## 4 32.7 65.4
## 5 26.6 53.2
## 6 27.4 54.8
A 70/30 split was created using a random number generator.
set.seed(123)
train_index <- sample(
seq_len(nrow(analysis_df)),
size = 0.7 * nrow(analysis_df)
)
train <- analysis_df[train_index, ]
test <- analysis_df[-train_index, ]
nrow(train)
## [1] 55895
nrow(test)
## [1] 23956
summary(analysis_df)
## year YearEnd LocationAbbr location
## Min. :2011 Min. :2011 Length :79851 Length :79851
## 1st Qu.:2013 1st Qu.:2013 N.unique : 55 N.unique : 55
## Median :2016 Median :2016 N.blank : 0 N.blank : 0
## Mean :2016 Mean :2016 Min.nchar: 2 Min.nchar: 4
## 3rd Qu.:2019 3rd Qu.:2019 Max.nchar: 2 Max.nchar: 20
## Max. :2021 Max. :2021
## Datasource Class Topic Question
## Length :79851 Length :79851 Length :79851 Length :79851
## N.unique : 1 N.unique : 3 N.unique : 3 N.unique : 9
## N.blank : 0 N.blank : 0 N.blank : 0 N.blank : 0
## Min.nchar: 42 Min.nchar: 17 Min.nchar: 23 Min.nchar: 58
## Max.nchar: 42 Max.nchar: 23 Max.nchar: 32 Max.nchar: 245
##
## Data_Value_Unit Data_Value_Type data_value Data_Value_Alt
## Mode:logical Length :79851 Min. : 0.90 Min. : 0.90
## NAs :79851 N.unique : 1 1st Qu.:24.30 1st Qu.:24.30
## N.blank : 0 Median :31.20 Median :31.20
## Min.nchar: 5 Mean :31.24 Mean :31.24
## Max.nchar: 5 3rd Qu.:37.00 3rd Qu.:37.00
## Max. :77.60 Max. :77.60
## Data_Value_Footnote_Symbol Data_Value_Footnote low_ci high_ci
## Length :79851 Length :79851 Min. : 0.30 Min. : 3.00
## N.unique : 0 N.unique : 0 1st Qu.:20.00 1st Qu.:28.60
## N.blank : 0 N.blank : 0 Median :26.80 Median :35.90
## Min.nchar: NA Min.nchar: NA Mean :26.91 Mean :36.14
## Max.nchar: NA Max.nchar: NA 3rd Qu.:32.90 3rd Qu.:42.20
## NAs :79851 NAs :79851 Max. :70.20 Max. :87.70
## sample_size Total age_group Education
## Min. : 50 Length :79851 Length :79851 Length :79851
## 1st Qu.: 516 N.unique : 1 N.unique : 6 N.unique : 4
## Median : 1109 N.blank : 0 N.blank : 0 N.blank : 0
## Mean : 3657 Min.nchar: 5 Min.nchar: 7 Min.nchar: 16
## 3rd Qu.: 2408 Max.nchar: 5 Max.nchar: 11 Max.nchar: 32
## Max. :476876 NAs :76703 NAs :60963 NAs :67259
## Gender Income race_ethnicity GeoLocation
## Length :79851 Length :79851 Length :79851 Length :79851
## N.unique : 2 N.unique : 7 N.unique : 8 N.unique : 105
## N.blank : 0 N.blank : 0 N.blank : 0 N.blank : 0
## Min.nchar: 4 Min.nchar: 17 Min.nchar: 5 Min.nchar: 23
## Max.nchar: 6 Max.nchar: 18 Max.nchar: 29 Max.nchar: 41
## NAs :73555 NAs :57820 NAs :62955 NAs : 1652
## ClassID TopicID QuestionID DataValueTypeID
## Length :79851 Length :79851 Length :79851 Length :79851
## N.unique : 3 N.unique : 3 N.unique : 9 N.unique : 1
## N.blank : 0 N.blank : 0 N.blank : 0 N.blank : 0
## Min.nchar: 2 Min.nchar: 3 Min.nchar: 4 Min.nchar: 5
## Max.nchar: 3 Max.nchar: 4 Max.nchar: 4 Max.nchar: 5
##
## LocationID StratificationCategory1 Stratification1
## Min. : 1.00 Length :79851 Length :79851
## 1st Qu.:17.00 N.unique : 6 N.unique : 28
## Median :30.00 N.blank : 0 N.blank : 0
## Mean :30.59 Min.nchar: 5 Min.nchar: 4
## 3rd Qu.:45.00 Max.nchar: 14 Max.nchar: 32
## Max. :78.00
## StratificationCategoryId1 StratificationID1 data_value_double
## Length :79851 Length :79851 Min. : 1.80
## N.unique : 6 N.unique : 28 1st Qu.: 48.60
## N.blank : 0 N.blank : 0 Median : 62.40
## Min.nchar: 3 Min.nchar: 4 Mean : 62.47
## Max.nchar: 5 Max.nchar: 11 3rd Qu.: 74.00
## Max. :155.20
mean(analysis_df$data_value)
## [1] 31.23649
median(analysis_df$data_value)
## [1] 31.2
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
getmode(analysis_df$data_value)
## [1] 35.4
range(analysis_df$data_value)
## [1] 0.9 77.6
plot(
analysis_df$year,
analysis_df$data_value,
main = "Year vs Data Value",
xlab = "Year",
ylab = "Data Value"
)
year_avg <- aggregate(
data_value ~ year,
data = analysis_df,
mean
)
barplot(
year_avg$data_value,
names.arg = year_avg$year,
main = "Average Data Value by Year",
xlab = "Year",
ylab = "Average Data Value"
)
cor(
analysis_df$year,
analysis_df$data_value,
use = "complete.obs",
method = "pearson"
)
## [1] 0.02204241
The Pearson correlation coefficient between year and data value is very close to zero, which indicates an extremely weak positive linear relationship.
This analysis used R programming to inspect, clean, manipulate,
summarize, and visualize the Nutrition, Physical Activity and Obesity
dataset. The dataset contained 88,629 original records, and 79,851
records had valid data_value observations for analysis. The
average data_value was approximately 31.24, with values
ranging from 0.9 to 77.6. The correlation between year and data value
was very weak, suggesting that there is no strong linear relationship
between these two variables in this dataset.