title: “Nutrition, Physical Activity and Obesity Data Analysis using R” output: html_document ——————–

Introduction

This report analyzes the Nutrition, Physical Activity and Obesity dataset using R programming. The purpose of this analysis is to inspect the dataset, clean and manipulate data, create new variables, calculate summary statistics, and visualize relationships between selected variables.

Load Dataset

nutrition <- readxl::read_excel(
  "D:/George_Brown_2/programming/a1/Nutrition.xlsx"
)

Dataset Structure

str(nutrition)

## tibble [88,629 × 33] (S3: tbl_df/tbl/data.frame)
##  $ YearStart                 : num [1:88629] 2020 2014 2013 2013 2015 ...
##  $ YearEnd                   : num [1:88629] 2020 2014 2013 2013 2015 ...
##  $ LocationAbbr              : chr [1:88629] "US" "GU" "US" "US" ...
##  $ LocationDesc              : chr [1:88629] "National" "Guam" "National" "National" ...
##  $ Datasource                : chr [1:88629] "Behavioral Risk Factor Surveillance System" "Behavioral Risk Factor Surveillance System" "Behavioral Risk Factor Surveillance System" "Behavioral Risk Factor Surveillance System" ...
##  $ Class                     : chr [1:88629] "Physical Activity" "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" ...
##  $ Topic                     : chr [1:88629] "Physical Activity - Behavior" "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" ...
##  $ Question                  : chr [1:88629] "Percent of adults who engage in no leisure-time physical activity" "Percent of adults aged 18 years and older who have obesity" "Percent of adults aged 18 years and older who have obesity" "Percent of adults aged 18 years and older who have an overweight classification" ...
##  $ Data_Value_Unit           : logi [1:88629] NA NA NA NA NA NA ...
##  $ Data_Value_Type           : chr [1:88629] "Value" "Value" "Value" "Value" ...
##  $ Data_Value                : num [1:88629] 30.6 29.3 28.8 32.7 26.6 27.4 48.5 31.6 38.1 35.2 ...
##  $ Data_Value_Alt            : num [1:88629] 30.6 29.3 28.8 32.7 26.6 27.4 48.5 31.6 38.1 35.2 ...
##  $ Data_Value_Footnote_Symbol: chr [1:88629] NA NA NA NA ...
##  $ Data_Value_Footnote       : chr [1:88629] NA NA NA NA ...
##  $ Low_Confidence_Limit      : num [1:88629] 29.4 25.7 28.1 31.9 25.6 18.6 32.3 24 32.6 30.7 ...
##  $ High_Confidence_Limit     : num [1:88629] 31.8 33.3 29.5 33.5 27.6 38.5 64.9 40.4 43.8 40 ...
##  $ Sample_Size               : num [1:88629] 31255 842 62562 60069 30904 ...
##  $ Total                     : chr [1:88629] NA NA NA NA ...
##  $ Age(years)                : chr [1:88629] NA NA NA NA ...
##  $ Education                 : chr [1:88629] NA "High school graduate" NA NA ...
##  $ Gender                    : chr [1:88629] NA NA NA NA ...
##  $ Income                    : chr [1:88629] NA NA "$50,000 - $74,999" "Data not reported" ...
##  $ Race/Ethnicity            : chr [1:88629] "Hispanic" NA NA NA ...
##  $ GeoLocation               : chr [1:88629] NA "(13.444304, 144.793731)" NA NA ...
##  $ ClassID                   : chr [1:88629] "PA" "OWS" "OWS" "OWS" ...
##  $ TopicID                   : chr [1:88629] "PA1" "OWS1" "OWS1" "OWS1" ...
##  $ QuestionID                : chr [1:88629] "Q047" "Q036" "Q036" "Q037" ...
##  $ DataValueTypeID           : chr [1:88629] "VALUE" "VALUE" "VALUE" "VALUE" ...
##  $ LocationID                : num [1:88629] 59 66 59 59 59 66 56 11 72 1 ...
##  $ StratificationCategory1   : chr [1:88629] "Race/Ethnicity" "Education" "Income" "Income" ...
##  $ Stratification1           : chr [1:88629] "Hispanic" "High school graduate" "$50,000 - $74,999" "Data not reported" ...
##  $ StratificationCategoryId1 : chr [1:88629] "RACE" "EDU" "INC" "INC" ...
##  $ StratificationID1         : chr [1:88629] "RACEHIS" "EDUHSGRAD" "INC5075" "INCNR" ...

Variables in Dataset

names(nutrition)

##  [1] "YearStart"                  "YearEnd"                   
##  [3] "LocationAbbr"               "LocationDesc"              
##  [5] "Datasource"                 "Class"                     
##  [7] "Topic"                      "Question"                  
##  [9] "Data_Value_Unit"            "Data_Value_Type"           
## [11] "Data_Value"                 "Data_Value_Alt"            
## [13] "Data_Value_Footnote_Symbol" "Data_Value_Footnote"       
## [15] "Low_Confidence_Limit"       "High_Confidence_Limit"     
## [17] "Sample_Size"                "Total"                     
## [19] "Age(years)"                 "Education"                 
## [21] "Gender"                     "Income"                    
## [23] "Race/Ethnicity"             "GeoLocation"               
## [25] "ClassID"                    "TopicID"                   
## [27] "QuestionID"                 "DataValueTypeID"           
## [29] "LocationID"                 "StratificationCategory1"   
## [31] "Stratification1"            "StratificationCategoryId1" 
## [33] "StratificationID1"

Top 15 Rows

head(nutrition, 15)

## # A tibble: 15 × 33
##    YearStart YearEnd LocationAbbr LocationDesc   Datasource Class Topic Question
##        <dbl>   <dbl> <chr>        <chr>          <chr>      <chr> <chr> <chr>   
##  1      2020    2020 US           National       Behaviora… Phys… Phys… Percent…
##  2      2014    2014 GU           Guam           Behaviora… Obes… Obes… Percent…
##  3      2013    2013 US           National       Behaviora… Obes… Obes… Percent…
##  4      2013    2013 US           National       Behaviora… Obes… Obes… Percent…
##  5      2015    2015 US           National       Behaviora… Phys… Phys… Percent…
##  6      2015    2015 GU           Guam           Behaviora… Phys… Phys… Percent…
##  7      2012    2012 WY           Wyoming        Behaviora… Obes… Obes… Percent…
##  8      2012    2012 DC           District of C… Behaviora… Obes… Obes… Percent…
##  9      2015    2015 PR           Puerto Rico    Behaviora… Phys… Phys… Percent…
## 10      2011    2011 AL           Alabama        Behaviora… Obes… Obes… Percent…
## 11      2015    2015 GU           Guam           Behaviora… Phys… Phys… Percent…
## 12      2015    2015 RI           Rhode Island   Behaviora… Obes… Obes… Percent…
## 13      2011    2011 US           National       Behaviora… Obes… Obes… Percent…
## 14      2012    2012 WY           Wyoming        Behaviora… Phys… Phys… Percent…
## 15      2020    2020 DE           Delaware       Behaviora… Phys… Phys… Percent…
## # ℹ 25 more variables: Data_Value_Unit <lgl>, Data_Value_Type <chr>,
## #   Data_Value <dbl>, Data_Value_Alt <dbl>, Data_Value_Footnote_Symbol <chr>,
## #   Data_Value_Footnote <chr>, Low_Confidence_Limit <dbl>,
## #   High_Confidence_Limit <dbl>, Sample_Size <dbl>, Total <chr>,
## #   `Age(years)` <chr>, Education <chr>, Gender <chr>, Income <chr>,
## #   `Race/Ethnicity` <chr>, GeoLocation <chr>, ClassID <chr>, TopicID <chr>,
## #   QuestionID <chr>, DataValueTypeID <chr>, LocationID <dbl>, …

Rename Columns

Some column names were renamed to make them easier to use in R.

nutrition_clean <- nutrition %>%
  rename(
    year = YearStart,
    location = LocationDesc,
    data_value = Data_Value,
    low_ci = Low_Confidence_Limit,
    high_ci = High_Confidence_Limit,
    sample_size = Sample_Size,
    age_group = `Age(years)`,
    race_ethnicity = `Race/Ethnicity`
  )

names(nutrition_clean)

##  [1] "year"                       "YearEnd"                   
##  [3] "LocationAbbr"               "location"                  
##  [5] "Datasource"                 "Class"                     
##  [7] "Topic"                      "Question"                  
##  [9] "Data_Value_Unit"            "Data_Value_Type"           
## [11] "data_value"                 "Data_Value_Alt"            
## [13] "Data_Value_Footnote_Symbol" "Data_Value_Footnote"       
## [15] "low_ci"                     "high_ci"                   
## [17] "sample_size"                "Total"                     
## [19] "age_group"                  "Education"                 
## [21] "Gender"                     "Income"                    
## [23] "race_ethnicity"             "GeoLocation"               
## [25] "ClassID"                    "TopicID"                   
## [27] "QuestionID"                 "DataValueTypeID"           
## [29] "LocationID"                 "StratificationCategory1"   
## [31] "Stratification1"            "StratificationCategoryId1" 
## [33] "StratificationID1"

User Defined Function

The following function classifies numerical values into low, medium, and high categories.

classify_value <- function(x) {
  if (x < 25) {
    return("Low")
  }
  
  if (x < 40) {
    return("Medium")
  }
  
  return("High")
}

classify_value(35)

## [1] "Medium"

Missing Values

sum(is.na(nutrition_clean))

## [1] 737063

colSums(is.na(nutrition_clean))

##                       year                    YearEnd 
##                          0                          0 
##               LocationAbbr                   location 
##                          0                          0 
##                 Datasource                      Class 
##                          0                          0 
##                      Topic                   Question 
##                          0                          0 
##            Data_Value_Unit            Data_Value_Type 
##                      88629                          0 
##                 data_value             Data_Value_Alt 
##                       8778                       8778 
## Data_Value_Footnote_Symbol        Data_Value_Footnote 
##                      79851                      79851 
##                     low_ci                    high_ci 
##                       8778                       8778 
##                sample_size                      Total 
##                       8778                      85464 
##                  age_group                  Education 
##                      69639                      75969 
##                     Gender                     Income 
##                      82299                      66474 
##             race_ethnicity                GeoLocation 
##                      63309                       1652 
##                    ClassID                    TopicID 
##                          0                          0 
##                 QuestionID            DataValueTypeID 
##                          0                          0 
##                 LocationID    StratificationCategory1 
##                          0                          9 
##            Stratification1  StratificationCategoryId1 
##                          9                          9 
##          StratificationID1 
##                          9

Complete-case removal was tested, but it removed all rows because each record contains at least one missing value in some field.

nutrition_no_na <- na.omit(nutrition_clean)

nrow(nutrition_clean)

## [1] 88629

nrow(nutrition_no_na)

## [1] 0

For the main analysis, only rows with missing data_value were removed.

analysis_df <- nutrition_clean %>%
  filter(!is.na(data_value))

nrow(analysis_df)

## [1] 79851

sum(is.na(analysis_df$data_value))

## [1] 0

Duplicated Data

sum(duplicated(nutrition_clean))

## [1] 0

No duplicated records were identified in the dataset.

Filter Rows

Rows with data_value greater than 50 were filtered.

high_values <- analysis_df %>%
  filter(data_value > 50)

nrow(high_values)

## [1] 3812

head(high_values)

## # A tibble: 6 × 33
##    year YearEnd LocationAbbr location            Datasource Class Topic Question
##   <dbl>   <dbl> <chr>        <chr>               <chr>      <chr> <chr> <chr>   
## 1  2015    2015 PR           Puerto Rico         Behaviora… Phys… Phys… Percent…
## 2  2015    2015 US           National            Behaviora… Phys… Phys… Percent…
## 3  2017    2017 WA           Washington          Behaviora… Phys… Phys… Percent…
## 4  2013    2013 MN           Minnesota           Behaviora… Phys… Phys… Percent…
## 5  2015    2015 PR           Puerto Rico         Behaviora… Phys… Phys… Percent…
## 6  2019    2019 DC           District of Columb… Behaviora… Phys… Phys… Percent…
## # ℹ 25 more variables: Data_Value_Unit <lgl>, Data_Value_Type <chr>,
## #   data_value <dbl>, Data_Value_Alt <dbl>, Data_Value_Footnote_Symbol <chr>,
## #   Data_Value_Footnote <chr>, low_ci <dbl>, high_ci <dbl>, sample_size <dbl>,
## #   Total <chr>, age_group <chr>, Education <chr>, Gender <chr>, Income <chr>,
## #   race_ethnicity <chr>, GeoLocation <chr>, ClassID <chr>, TopicID <chr>,
## #   QuestionID <chr>, DataValueTypeID <chr>, LocationID <dbl>,
## #   StratificationCategory1 <chr>, Stratification1 <chr>, …

Dependent and Independent Variables

The dependent variable selected for this analysis is data_value.

Independent variables include year, age_group, Education, Gender, Income, and race_ethnicity.

model_df <- analysis_df %>%
  select(
    data_value,
    year,
    age_group,
    Education,
    Gender,
    Income,
    race_ethnicity
  )

head(model_df)

## # A tibble: 6 × 7
##   data_value  year age_group Education            Gender Income   race_ethnicity
##        <dbl> <dbl> <chr>     <chr>                <chr>  <chr>    <chr>         
## 1       30.6  2020 <NA>      <NA>                 <NA>   <NA>     Hispanic      
## 2       29.3  2014 <NA>      High school graduate <NA>   <NA>     <NA>          
## 3       28.8  2013 <NA>      <NA>                 <NA>   $50,000… <NA>          
## 4       32.7  2013 <NA>      <NA>                 <NA>   Data no… <NA>          
## 5       26.6  2015 <NA>      <NA>                 <NA>   Less th… <NA>          
## 6       27.4  2015 <NA>      <NA>                 <NA>   <NA>     Hispanic

Reorder Rows in Descending Order

sorted_df <- analysis_df %>%
  arrange(desc(data_value))

head(sorted_df)

## # A tibble: 6 × 33
##    year YearEnd LocationAbbr location   Datasource          Class Topic Question
##   <dbl>   <dbl> <chr>        <chr>      <chr>               <chr> <chr> <chr>   
## 1  2015    2015 OR           Oregon     Behavioral Risk Fa… Phys… Phys… Percent…
## 2  2011    2011 MT           Montana    Behavioral Risk Fa… Phys… Phys… Percent…
## 3  2011    2011 ID           Idaho      Behavioral Risk Fa… Phys… Phys… Percent…
## 4  2013    2013 NM           New Mexico Behavioral Risk Fa… Phys… Phys… Percent…
## 5  2019    2019 WY           Wyoming    Behavioral Risk Fa… Phys… Phys… Percent…
## 6  2017    2017 NM           New Mexico Behavioral Risk Fa… Phys… Phys… Percent…
## # ℹ 25 more variables: Data_Value_Unit <lgl>, Data_Value_Type <chr>,
## #   data_value <dbl>, Data_Value_Alt <dbl>, Data_Value_Footnote_Symbol <chr>,
## #   Data_Value_Footnote <chr>, low_ci <dbl>, high_ci <dbl>, sample_size <dbl>,
## #   Total <chr>, age_group <chr>, Education <chr>, Gender <chr>, Income <chr>,
## #   race_ethnicity <chr>, GeoLocation <chr>, ClassID <chr>, TopicID <chr>,
## #   QuestionID <chr>, DataValueTypeID <chr>, LocationID <dbl>,
## #   StratificationCategory1 <chr>, Stratification1 <chr>, …

Add New Variable

A new variable was created by multiplying data_value by 2.

analysis_df <- analysis_df %>%
  mutate(data_value_double = data_value * 2)

head(
  analysis_df %>%
    select(data_value, data_value_double)
)

## # A tibble: 6 × 2
##   data_value data_value_double
##        <dbl>             <dbl>
## 1       30.6              61.2
## 2       29.3              58.6
## 3       28.8              57.6
## 4       32.7              65.4
## 5       26.6              53.2
## 6       27.4              54.8

Training and Testing Sets

A 70/30 split was created using a random number generator.

set.seed(123)

train_index <- sample(
  seq_len(nrow(analysis_df)),
  size = 0.7 * nrow(analysis_df)
)

train <- analysis_df[train_index, ]
test <- analysis_df[-train_index, ]

nrow(train)

## [1] 55895

nrow(test)

## [1] 23956

Summary Statistics

summary(analysis_df)

##       year         YearEnd        LocationAbbr        location    
##  Min.   :2011   Min.   :2011   Length   :79851   Length   :79851  
##  1st Qu.:2013   1st Qu.:2013   N.unique :   55   N.unique :   55  
##  Median :2016   Median :2016   N.blank  :    0   N.blank  :    0  
##  Mean   :2016   Mean   :2016   Min.nchar:    2   Min.nchar:    4  
##  3rd Qu.:2019   3rd Qu.:2019   Max.nchar:    2   Max.nchar:   20  
##  Max.   :2021   Max.   :2021                                      
##      Datasource          Class             Topic            Question    
##  Length   :79851   Length   :79851   Length   :79851   Length   :79851  
##  N.unique :    1   N.unique :    3   N.unique :    3   N.unique :    9  
##  N.blank  :    0   N.blank  :    0   N.blank  :    0   N.blank  :    0  
##  Min.nchar:   42   Min.nchar:   17   Min.nchar:   23   Min.nchar:   58  
##  Max.nchar:   42   Max.nchar:   23   Max.nchar:   32   Max.nchar:  245  
##                                                                         
##  Data_Value_Unit  Data_Value_Type    data_value    Data_Value_Alt 
##  Mode:logical    Length   :79851   Min.   : 0.90   Min.   : 0.90  
##  NAs :79851      N.unique :    1   1st Qu.:24.30   1st Qu.:24.30  
##                  N.blank  :    0   Median :31.20   Median :31.20  
##                  Min.nchar:    5   Mean   :31.24   Mean   :31.24  
##                  Max.nchar:    5   3rd Qu.:37.00   3rd Qu.:37.00  
##                                    Max.   :77.60   Max.   :77.60  
##  Data_Value_Footnote_Symbol Data_Value_Footnote     low_ci         high_ci     
##  Length   :79851            Length   :79851     Min.   : 0.30   Min.   : 3.00  
##  N.unique :    0            N.unique :    0     1st Qu.:20.00   1st Qu.:28.60  
##  N.blank  :    0            N.blank  :    0     Median :26.80   Median :35.90  
##  Min.nchar:   NA            Min.nchar:   NA     Mean   :26.91   Mean   :36.14  
##  Max.nchar:   NA            Max.nchar:   NA     3rd Qu.:32.90   3rd Qu.:42.20  
##  NAs      :79851            NAs      :79851     Max.   :70.20   Max.   :87.70  
##   sample_size           Total           age_group         Education    
##  Min.   :    50   Length   :79851   Length   :79851   Length   :79851  
##  1st Qu.:   516   N.unique :    1   N.unique :    6   N.unique :    4  
##  Median :  1109   N.blank  :    0   N.blank  :    0   N.blank  :    0  
##  Mean   :  3657   Min.nchar:    5   Min.nchar:    7   Min.nchar:   16  
##  3rd Qu.:  2408   Max.nchar:    5   Max.nchar:   11   Max.nchar:   32  
##  Max.   :476876   NAs      :76703   NAs      :60963   NAs      :67259  
##        Gender            Income        race_ethnicity     GeoLocation   
##  Length   :79851   Length   :79851   Length   :79851   Length   :79851  
##  N.unique :    2   N.unique :    7   N.unique :    8   N.unique :  105  
##  N.blank  :    0   N.blank  :    0   N.blank  :    0   N.blank  :    0  
##  Min.nchar:    4   Min.nchar:   17   Min.nchar:    5   Min.nchar:   23  
##  Max.nchar:    6   Max.nchar:   18   Max.nchar:   29   Max.nchar:   41  
##  NAs      :73555   NAs      :57820   NAs      :62955   NAs      : 1652  
##       ClassID           TopicID          QuestionID     DataValueTypeID 
##  Length   :79851   Length   :79851   Length   :79851   Length   :79851  
##  N.unique :    3   N.unique :    3   N.unique :    9   N.unique :    1  
##  N.blank  :    0   N.blank  :    0   N.blank  :    0   N.blank  :    0  
##  Min.nchar:    2   Min.nchar:    3   Min.nchar:    4   Min.nchar:    5  
##  Max.nchar:    3   Max.nchar:    4   Max.nchar:    4   Max.nchar:    5  
##                                                                         
##    LocationID    StratificationCategory1  Stratification1 
##  Min.   : 1.00   Length   :79851         Length   :79851  
##  1st Qu.:17.00   N.unique :    6         N.unique :   28  
##  Median :30.00   N.blank  :    0         N.blank  :    0  
##  Mean   :30.59   Min.nchar:    5         Min.nchar:    4  
##  3rd Qu.:45.00   Max.nchar:   14         Max.nchar:   32  
##  Max.   :78.00                                            
##  StratificationCategoryId1 StratificationID1 data_value_double
##  Length   :79851           Length   :79851   Min.   :  1.80   
##  N.unique :    6           N.unique :   28   1st Qu.: 48.60   
##  N.blank  :    0           N.blank  :    0   Median : 62.40   
##  Min.nchar:    3           Min.nchar:    4   Mean   : 62.47   
##  Max.nchar:    5           Max.nchar:   11   3rd Qu.: 74.00   
##                                              Max.   :155.20

Mean, Median, Mode and Range

mean(analysis_df$data_value)

## [1] 31.23649

median(analysis_df$data_value)

## [1] 31.2

getmode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

getmode(analysis_df$data_value)

## [1] 35.4

range(analysis_df$data_value)

## [1]  0.9 77.6

Scatter Plot

plot(
  analysis_df$year,
  analysis_df$data_value,
  main = "Year vs Data Value",
  xlab = "Year",
  ylab = "Data Value"
)

Bar Plot

year_avg <- aggregate(
  data_value ~ year,
  data = analysis_df,
  mean
)

barplot(
  year_avg$data_value,
  names.arg = year_avg$year,
  main = "Average Data Value by Year",
  xlab = "Year",
  ylab = "Average Data Value"
)

Pearson Correlation

cor(
  analysis_df$year,
  analysis_df$data_value,
  use = "complete.obs",
  method = "pearson"
)

## [1] 0.02204241

The Pearson correlation coefficient between year and data value is very close to zero, which indicates an extremely weak positive linear relationship.

Conclusion

This analysis used R programming to inspect, clean, manipulate, summarize, and visualize the Nutrition, Physical Activity and Obesity dataset. The dataset contained 88,629 original records, and 79,851 records had valid data_value observations for analysis. The average data_value was approximately 31.24, with values ranging from 0.9 to 77.6. The correlation between year and data value was very weak, suggesting that there is no strong linear relationship between these two variables in this dataset.