Assignment 14 Correction

This is a markdown for the correction of assignment 14. As the original assignment was meant do be done in STATA, I’ll try to translate the commands to R to the best of my ability. I don’t currently have a STATA license and I’m also not too proficient with it, so I’m going to show you how things would be done, in a lot of detail, in R.

First, let’s begin by initiating the libraries that will be necessary for the assignment

## this library is used by almost everyone that uses R and it introduces the notion of pipes (%>%)
library(tidyverse)
## this is used to read the excel sheet that was provided
library(readxl)
## this is used to summarize the continuous data in a more neat way than base R does
library(skimr)

Then, let’s read in the data. This data contains multiple entries of “.” to represent missing values. In order to correctly code this, we pass the argument na = "." to the function. We then use -> to assign the data that was read to the variable df.

read_excel("PPCR_2005_2006.xlsx", na = ".") -> df

Now, as you can see below, there are 19 observations in which all variables equal NA.

## this command takes the data that is stored in `df` and provides it as the 
## first argument for the `filter()` function, which, in this situation
## is evaluating if all (`if_all()` function) observations, across all the
## variables (`everything()` function) are equal to `NA` (`is.na()` function) 
df %>% filter(if_all(everything(), ~ is.na(.)))

## # A tibble: 19 × 136
##     SEQN Gender Age at Screening Adjud…¹ Race/Ethnicity - Rec…² `Marital Status`
##    <dbl>  <dbl>                    <dbl>                  <dbl> <chr>           
##  1    NA     NA                       NA                     NA <NA>            
##  2    NA     NA                       NA                     NA <NA>            
##  3    NA     NA                       NA                     NA <NA>            
##  4    NA     NA                       NA                     NA <NA>            
##  5    NA     NA                       NA                     NA <NA>            
##  6    NA     NA                       NA                     NA <NA>            
##  7    NA     NA                       NA                     NA <NA>            
##  8    NA     NA                       NA                     NA <NA>            
##  9    NA     NA                       NA                     NA <NA>            
## 10    NA     NA                       NA                     NA <NA>            
## 11    NA     NA                       NA                     NA <NA>            
## 12    NA     NA                       NA                     NA <NA>            
## 13    NA     NA                       NA                     NA <NA>            
## 14    NA     NA                       NA                     NA <NA>            
## 15    NA     NA                       NA                     NA <NA>            
## 16    NA     NA                       NA                     NA <NA>            
## 17    NA     NA                       NA                     NA <NA>            
## 18    NA     NA                       NA                     NA <NA>            
## 19    NA     NA                       NA                     NA <NA>            
## # ℹ abbreviated names: ¹`Age at Screening Adjudicated - Recode`,
## #   ²`Race/Ethnicity - Recode`
## # ℹ 131 more variables: `Education Level - Adults 20+` <chr>,
## #   `Annual Family Income` <chr>, `Family PIR` <chr>, `Energy (kcal)` <chr>,
## #   `Protein (gm)` <chr>, `Carbohydrate (gm)` <chr>, `Total sugars (gm)` <chr>,
## #   `Dietary fiber (gm)` <chr>, `Total fat (gm)` <chr>, `Vitamin C (mg)` <chr>,
## #   `Vitamin K (mcg)` <chr>, `Calcium (mg)` <chr>, `Phosphorus (mg)` <chr>, …

Let’s drop those observations.

## this command filters across all columns only those observations in which at least one value isn't NA
df %>% filter(if_any(everything(), ~!is.na(.))) -> df


## if we now repeat the same command as above, we can see that there are no more fully NA observations
df %>% filter(if_all(everything(), ~ is.na(.)))

## # A tibble: 0 × 136
## # ℹ 136 variables: SEQN <dbl>, Gender <dbl>,
## #   Age at Screening Adjudicated - Recode <dbl>, Race/Ethnicity - Recode <dbl>,
## #   Marital Status <chr>, Education Level - Adults 20+ <chr>,
## #   Annual Family Income <chr>, Family PIR <chr>, Energy (kcal) <chr>,
## #   Protein (gm) <chr>, Carbohydrate (gm) <chr>, Total sugars (gm) <chr>,
## #   Dietary fiber (gm) <chr>, Total fat (gm) <chr>, Vitamin C (mg) <chr>,
## #   Vitamin K (mcg) <chr>, Calcium (mg) <chr>, Phosphorus (mg) <chr>, …

Maria Sarquis used the command Summarize Protein mg in Stata. I will reproduce it in R now. In order to do so, I will need to convert this variable to a numeric variable, as the data was imported as chr.

Other students used different variables, I’ll reproduce the same steps for different continuous variables that were analysed by the students.

df %>% mutate(`Protein (gm)` = as.numeric(`Protein (gm)`),
          `Room Temperature (F)`= as.numeric(`Room Temperature (F)`),
          `Energy (kcal)` = as.numeric(`Energy (kcal)`),
          `Vitamin C (mg)` = as.numeric(`Vitamin C (mg)`),
          `How much sleep do you get (hours)?` = as.numeric(`How much sleep do you get (hours)?`),
          `Annual Family Income` = as.numeric(`Annual Family Income`),
          `Room Humidity (%)` = as.numeric(`Room Humidity (%)`)) -> df

Now we can use skimr::skim to evaluate these continuous variables.

df %>% select(`Protein (gm)`, `Room Temperature (F)`, `Energy (kcal)`, `Vitamin C (mg)`, `How much sleep do you get (hours)?`, `Age at Screening Adjudicated - Recode`, `Annual Family Income`, `Room Humidity (%)`) %>% skim()

Data summary
Name	Piped data
Number of rows	10348
Number of columns	8
_______________________
Column type frequency:
numeric	8
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Protein (gm)	1160	0.89	5.15	8.62	0.0	0.36	2.19	6.24	119.61	▇▁▁▁▁
Room Temperature (F)	3388	0.67	73.97	5.48	50.1	70.50	73.90	77.30	95.60	▁▂▇▃▁
Energy (kcal)	1160	0.89	150.90	188.33	0.0	10.00	113.00	202.00	3209.00	▇▁▁▁▁
Vitamin C (mg)	1160	0.89	7.48	32.15	0.0	0.00	0.00	3.40	1317.10	▇▁▁▁▁
How much sleep do you get (hours)?	2377	0.77	7.13	3.88	1.0	6.00	7.00	8.00	99.00	▇▁▁▁▁
Age at Screening Adjudicated - Recode	0	1.00	28.00	24.08	0.0	9.00	19.00	45.00	85.00	▇▃▂▂▂
Annual Family Income	110	0.99	8.73	12.59	1.0	5.00	7.00	10.00	99.00	▇▁▁▁▁
Room Humidity (%)	3387	0.67	52.27	12.47	9.1	44.40	52.50	60.80	92.50	▁▃▇▅▁

The variable How much sleep do you get (hours)? and Annual Family Income are special cases, and we need to be careful with them. Review the p100 above for these variables. Both are 99. Take a look at the documentation for each of these variables in the database (Here and here).

How much sleep do you get (hours)? Annual Family Income Codebook

We can see that, while How much sleep do you get (hours)? is a continuous variable, we need to remove the observations that have values equal to 77 and 99, as those are not the actual values for these observations.

## this checks if `How much sleep do you get (hours)?` is 77 or 99, if it is either of those values, it replaces it with NA. If it's not, it keeps the value.
df %>% mutate(`How much sleep do you get (hours)?` = case_when(
        (`How much sleep do you get (hours)?` == 77) | (`How much sleep do you get (hours)?` == 99) ~ NA,
        .default = `How much sleep do you get (hours)?`)) -> df

df %>% select(`How much sleep do you get (hours)?`) %>% skim()

Data summary
Name	Piped data
Number of rows	10348
Number of columns	1
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
How much sleep do you get (hours)?	2390	0.77	6.98	1.48	1	6	7	8	12	▁▂▇▆▁

With this, we added another 13 missing values to this variable, and we now do not have improper outliers that skew the data.

On the other hand, Annual Family Income is not a continuous variable, and we need to code it as a categorical variable (in R, these are called factors).

df %>% mutate(`Annual Family Income` = as_factor(`Annual Family Income`)) -> df

We can also add the labels to this factor.

df %>% mutate(`Annual Family Income` = case_when(`Annual Family Income` == 1 ~ "$ 0 to $ 4,999",
                                                  `Annual Family Income` == 2 ~ "$ 5,000 to $ 9,999",
                                                  `Annual Family Income` == 3 ~ "$10,000 to $14,999",
                                                  `Annual Family Income` == 4 ~ "$15,000 to $19,999",
                                                  `Annual Family Income` == 5 ~ "$20,000 to $24,999",
                                                  `Annual Family Income` == 6 ~ "$25,000 to $34,999",
                                                  `Annual Family Income` == 7 ~ "$35,000 to $44,999",
                                                  `Annual Family Income` == 8 ~ "$45,000 to $54,999",
                                                  `Annual Family Income` == 9 ~ "$55,000 to $64,999",
                                                  `Annual Family Income` == 10 ~ "$65,000 to $74,999",
                                                  `Annual Family Income` == 11 ~ "$75,000 and Over",
                                                  `Annual Family Income` == 12 ~ "Over $20,000",
                                                  `Annual Family Income` == 13 ~ "Under $20,000",
                                                  `Annual Family Income` == 77 ~ "Refused",
                                                  `Annual Family Income` == 99 ~ "Don't know"
                                                  )) %>% 
        mutate(`Annual Family Income` = as_factor(`Annual Family Income`)) -> df

Now, we can look at the distribution for this variable.

df %>% select(`Annual Family Income`) %>% pull() %>% summary()

## $15,000 to $19,999 $20,000 to $24,999 $65,000 to $74,999   $75,000 and Over 
##                834                856                525               2080 
##       Over $20,000 $35,000 to $44,999     $ 0 to $ 4,999 $10,000 to $14,999 
##                136               1011                347                864 
## $45,000 to $54,999 $25,000 to $34,999 $55,000 to $64,999 $ 5,000 to $ 9,999 
##                896               1338                575                488 
##      Under $20,000         Don't know            Refused               NA's 
##                 73                141                 74                110

Other categorical variables were examined by the students. Tatiana Soares looked at the Bed Surface Vacuumed, Ever told by doctor have sleep disorder? and Type of Floor Covering variables.

Let’s look at the documentation for these variables (Here, here and here)

Bed Surface Vacuumed Ever told by doctor have sleep disorder? Type of Floor Covering Documentation

Knowing what the values mean, let’s take a look at the variables themselves.

df %>% select(`Bed Surface Vacuumed`) %>% pull() %>% as_factor() %>% summary()

##    4    1    2    3    9 NA's 
##  957 5415  110  447    6 3413

df %>% select(`Ever told by doctor have sleep disorder?`) %>% pull() %>% as_factor() %>% summary()

##    2    1    9    7 NA's 
## 7477  486    6    2 2377

df %>% select(`Type of Floor Covering`) %>% pull() %>% as_factor() %>% summary()

##    1    2    3    9    4 NA's 
## 5084  706  815  144  198 3401

With this, we can see there are missing data.

To underscore the importance of cleaning data, take a look at the histogram for the How much sleep do you get (hours)? after we clean the data.

## we're going to use ggplot to create a plot in R
## there are simpler ways to plot in R, but ggplot, when understood
## is very powerful and can create beautiful, detailed and custom plots

df %>% ggplot(aes(x = `How much sleep do you get (hours)?`)) +
        geom_histogram(bins = 12)

I hope the above has been useful to you all. Remember, it always pays off to carefully examine the documentation for the database before proceeding with an analysis, as data needs to be properly understood and cleaned before it can be analysed, even if for descriptive purposes.

Assignment 14 Correction

Pedro Henrique Brant

2024-09-12