Exploratory data analysis

The objective

Come up with three research questions that you want to answer using these data. You should phrase your research questions in a way that matches up with the scope of inference your dataset allows for. Make sure that at least two of these questions involve at least three variables. You are welcomed to create new variables based on existing ones. Along with each research question include a brief discussion (1-2 sentences) as to why this question is of interest to you and/or your audience. Perform exploratory data analysis (EDA) that addresses each of the three research questions you outlined above. Your EDA should contain numerical summaries and visualizations. Each R output and plot should be accompanied by a brief interpretation.

Of note, each analysis is presented in a separate document.

Setup

Loading packages

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.1

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## Warning: package 'readr' was built under R version 4.2.1

## Warning: package 'forcats' was built under R version 4.2.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(Hmisc)

## Warning: package 'Hmisc' was built under R version 4.2.1

## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, units

Loading data

brfss<-read.csv("brfss2013.csv")

Research question:

The objective

It is hypothesized (let´s say, based on the previous research), that individually perceived health might influence sleep pattern, therefore the objective of the following exploratory analysis is to investigate potential difference between self reported sleep pattern according to self reported general health in males and females from the BRFSS2013 dataset.

Method:

For the purpose of the current evaluation, 2013 Behavioral Risk Factor Surveillance System dataset and Behavioral Risk Factor Surveillance System 2013 Codebook Report, Land-Line, Cell-Phone data October 24, 2014 were used. Gender and self reported general health (BRFSS question: “Would you say that in general your health is excellent, very good, good, fair, poor?”) are categorical variables while self reported sleep pattern is a numeric variable (BRFSS question: “On average, how many hours of sleep do you get in a 24-hour period?”). At first the data are described to see distribution, potential outliers and missing data. After that, data are cleaned for the purpose of the exploratory analysis, missing values will be removed (it is expected they are randomly distributed), at the same time, only values within the interval 0-24hours are considered as relevant for sleep pattern analysis. Exploratory analysis: to visualize potential difference in sleep pattern variable distribution, a box plot will be used while categorical variables (gender, self reported general health) will be used to disaggregate the data. R software is used for the data description and analysis (R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/).

Results

Descriptive statistics

describe (brfss$sex)

## brfss$sex 
##        n  missing distinct 
##   491768        7        2 
##                         
## Value      Female   Male
## Frequency  290455 201313
## Proportion  0.591  0.409

describe (brfss$genhlth)

## brfss$genhlth 
##        n  missing distinct 
##   489790     1985        5 
## 
## lowest : Excellent Fair      Good      Poor      Very good
## highest: Excellent Fair      Good      Poor      Very good
##                                                             
## Value      Excellent      Fair      Good      Poor Very good
## Frequency      85482     66726    150555     27951    159076
## Proportion     0.175     0.136     0.307     0.057     0.325

describe(brfss$sleptim1)

## brfss$sleptim1 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##   484388     7387       27    0.939    7.052    1.512        5        5 
##      .25      .50      .75      .90      .95 
##        6        7        8        8        9 
## 
## lowest :   0   1   2   3   4, highest:  22  23  24 103 450

median(brfss$sleptim1, na.rm=TRUE)

## [1] 7

There was totally 491 775 respondents, while data for gender were available for 491 768 respondents with 59% being female. As can be seen above, this variable contains 7 missing observations. In terms of self reported general health, data were available for 489 790 respondents with 1 985 missing values. Most of respondents considered their general health as “Very good” (32.5%) or “Good” (30.7%). Data on self reported sleep pattern were available for 484 388 respondents, who reported 0 - 450 of hours of sleep a day. For the purpose of the exploratory analysis (presented below), values outside 0-24hours will be excluded (these are considered incorrect with respect to the question being asked). Following histogram presents distribution of self reported sleep pattern.

brfss %>% 
  filter(sleptim1%in% 0:24) %>%
  ggplot(aes(x=sleptim1))+geom_histogram(colour=4, fill = "lightblue", binwidth =1)+theme_bw()

We can see that distribution of the sleep pattern variable is nearly normal, symmetric, unimodal, centered around 7 hours of sleep/day.

Further, descriptive statistics data for the self reported sleep pattern variable (taking into account only values in the range 0-24 hours) are presented:

brfss %>% 
  filter(sleptim1 %in% 0:24) %>%
  summarise(mean_sleep=mean(sleptim1), median_sleep=median(sleptim1), iqr_sleep=IQR(sleptim1), sd_sleep=sd(sleptim1))

##   mean_sleep median_sleep iqr_sleep sd_sleep
## 1   7.050986            7         2 1.465987

Median and mean are very close which shows normal distribution of the variable.

Lastly,exploratory analysis is presented.

Exploratory analysis

To see individual values in categorical variables (sex and general health evaluation):

unique(brfss$sex)

## [1] "Female" "Male"   ""

unique(brfss$gen)

## [1] "Fair"      "Good"      "Very good" "Excellent" "Poor"      ""

A box plot showing graphical representation of sleep pattern in the individual general health categories in males and females.

brfss%>%
  select(sex, sleptim1, genhlth) %>%
  filter(sleptim1 %in% 0:24) %>%
  mutate(sex=na_if(sex,"")) %>%
  mutate(genhlth=na_if(genhlth,"")) %>% 
  drop_na(sex, genhlth)%>%
  ggplot(aes(genhlth, sleptim1, fill=sex))+geom_boxplot()+facet_wrap(~sex, ncol=2)

Conclusion:

As can be seen, self reported sleep pattern variable is nearly normally distributed in females rating their general health as very good, good and fair. The distribution is right skewed in those reporting both extreme categories - “Excellent” and “Poor” general health. In the “Poor” general health category, median is less compared to other general health categories. In males, distribution is symmetric in all general health categories except for those reporting “Poor” health. Median is the same for all general health categories.

Data Analysis Project; Introduction to Probability and Data with R

Petra Matoulková

2022-08-07

Exploratory data analysis

The objective

Setup

Loading packages

Loading data

Research question:

The objective

Method:

Results

Descriptive statistics

Exploratory analysis

Conclusion: