This is a markdown for the correction of assignment 14. As the original assignment was meant do be done in STATA, I’ll try to translate the commands to R to the best of my ability. I don’t currently have a STATA license and I’m also not too proficient with it, so I’m going to show you how things would be done, in a lot of detail, in R.
First, let’s begin by initiating the libraries that will be necessary for the assignment
## this library is used by almost everyone that uses R and it introduces the notion of pipes (%>%)
library(tidyverse)
## this is used to read the excel sheet that was provided
library(readxl)
## this is used to summarize the continuous data in a more neat way than base R does
library(skimr)
Then, let’s read in the data. This data contains multiple entries of
“.” to represent missing values. In order to correctly code this, we
pass the argument na = "." to the function. We then use
-> to assign the data that was read to the variable
df.
read_excel("PPCR_2005_2006.xlsx", na = ".") -> df
Now, as you can see below, there are 19 observations in which all
variables equal NA.
## this command takes the data that is stored in `df` and provides it as the
## first argument for the `filter()` function, which, in this situation
## is evaluating if all (`if_all()` function) observations, across all the
## variables (`everything()` function) are equal to `NA` (`is.na()` function)
df %>% filter(if_all(everything(), ~ is.na(.)))
## # A tibble: 19 × 136
## SEQN Gender Age at Screening Adjud…¹ Race/Ethnicity - Rec…² `Marital Status`
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 NA NA NA NA <NA>
## 2 NA NA NA NA <NA>
## 3 NA NA NA NA <NA>
## 4 NA NA NA NA <NA>
## 5 NA NA NA NA <NA>
## 6 NA NA NA NA <NA>
## 7 NA NA NA NA <NA>
## 8 NA NA NA NA <NA>
## 9 NA NA NA NA <NA>
## 10 NA NA NA NA <NA>
## 11 NA NA NA NA <NA>
## 12 NA NA NA NA <NA>
## 13 NA NA NA NA <NA>
## 14 NA NA NA NA <NA>
## 15 NA NA NA NA <NA>
## 16 NA NA NA NA <NA>
## 17 NA NA NA NA <NA>
## 18 NA NA NA NA <NA>
## 19 NA NA NA NA <NA>
## # ℹ abbreviated names: ¹`Age at Screening Adjudicated - Recode`,
## # ²`Race/Ethnicity - Recode`
## # ℹ 131 more variables: `Education Level - Adults 20+` <chr>,
## # `Annual Family Income` <chr>, `Family PIR` <chr>, `Energy (kcal)` <chr>,
## # `Protein (gm)` <chr>, `Carbohydrate (gm)` <chr>, `Total sugars (gm)` <chr>,
## # `Dietary fiber (gm)` <chr>, `Total fat (gm)` <chr>, `Vitamin C (mg)` <chr>,
## # `Vitamin K (mcg)` <chr>, `Calcium (mg)` <chr>, `Phosphorus (mg)` <chr>, …
Let’s drop those observations.
## this command filters across all columns only those observations in which at least one value isn't NA
df %>% filter(if_any(everything(), ~!is.na(.))) -> df
## if we now repeat the same command as above, we can see that there are no more fully NA observations
df %>% filter(if_all(everything(), ~ is.na(.)))
## # A tibble: 0 × 136
## # ℹ 136 variables: SEQN <dbl>, Gender <dbl>,
## # Age at Screening Adjudicated - Recode <dbl>, Race/Ethnicity - Recode <dbl>,
## # Marital Status <chr>, Education Level - Adults 20+ <chr>,
## # Annual Family Income <chr>, Family PIR <chr>, Energy (kcal) <chr>,
## # Protein (gm) <chr>, Carbohydrate (gm) <chr>, Total sugars (gm) <chr>,
## # Dietary fiber (gm) <chr>, Total fat (gm) <chr>, Vitamin C (mg) <chr>,
## # Vitamin K (mcg) <chr>, Calcium (mg) <chr>, Phosphorus (mg) <chr>, …
Maria Sarquis used the command Summarize Protein mg in
Stata. I will reproduce it in R now. In order to do so, I will need to
convert this variable to a numeric variable, as the data was imported as
chr.
Other students used different variables, I’ll reproduce the same steps for different continuous variables that were analysed by the students.
df %>% mutate(`Protein (gm)` = as.numeric(`Protein (gm)`),
`Room Temperature (F)`= as.numeric(`Room Temperature (F)`),
`Energy (kcal)` = as.numeric(`Energy (kcal)`),
`Vitamin C (mg)` = as.numeric(`Vitamin C (mg)`),
`How much sleep do you get (hours)?` = as.numeric(`How much sleep do you get (hours)?`),
`Annual Family Income` = as.numeric(`Annual Family Income`),
`Room Humidity (%)` = as.numeric(`Room Humidity (%)`)) -> df
Now we can use skimr::skim to evaluate these continuous
variables.
df %>% select(`Protein (gm)`, `Room Temperature (F)`, `Energy (kcal)`, `Vitamin C (mg)`, `How much sleep do you get (hours)?`, `Age at Screening Adjudicated - Recode`, `Annual Family Income`, `Room Humidity (%)`) %>% skim()
| Name | Piped data |
| Number of rows | 10348 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Protein (gm) | 1160 | 0.89 | 5.15 | 8.62 | 0.0 | 0.36 | 2.19 | 6.24 | 119.61 | ▇▁▁▁▁ |
| Room Temperature (F) | 3388 | 0.67 | 73.97 | 5.48 | 50.1 | 70.50 | 73.90 | 77.30 | 95.60 | ▁▂▇▃▁ |
| Energy (kcal) | 1160 | 0.89 | 150.90 | 188.33 | 0.0 | 10.00 | 113.00 | 202.00 | 3209.00 | ▇▁▁▁▁ |
| Vitamin C (mg) | 1160 | 0.89 | 7.48 | 32.15 | 0.0 | 0.00 | 0.00 | 3.40 | 1317.10 | ▇▁▁▁▁ |
| How much sleep do you get (hours)? | 2377 | 0.77 | 7.13 | 3.88 | 1.0 | 6.00 | 7.00 | 8.00 | 99.00 | ▇▁▁▁▁ |
| Age at Screening Adjudicated - Recode | 0 | 1.00 | 28.00 | 24.08 | 0.0 | 9.00 | 19.00 | 45.00 | 85.00 | ▇▃▂▂▂ |
| Annual Family Income | 110 | 0.99 | 8.73 | 12.59 | 1.0 | 5.00 | 7.00 | 10.00 | 99.00 | ▇▁▁▁▁ |
| Room Humidity (%) | 3387 | 0.67 | 52.27 | 12.47 | 9.1 | 44.40 | 52.50 | 60.80 | 92.50 | ▁▃▇▅▁ |
The variable How much sleep do you get (hours)? and
Annual Family Income are special cases, and we need to be
careful with them. Review the p100 above for these
variables. Both are 99. Take a look at the documentation
for each of these variables in the database (Here and
here).
We can see that, while
How much sleep do you get (hours)? is a continuous
variable, we need to remove the observations that have values equal to
77 and 99, as those are not the actual values
for these observations.
## this checks if `How much sleep do you get (hours)?` is 77 or 99, if it is either of those values, it replaces it with NA. If it's not, it keeps the value.
df %>% mutate(`How much sleep do you get (hours)?` = case_when(
(`How much sleep do you get (hours)?` == 77) | (`How much sleep do you get (hours)?` == 99) ~ NA,
.default = `How much sleep do you get (hours)?`)) -> df
df %>% select(`How much sleep do you get (hours)?`) %>% skim()
| Name | Piped data |
| Number of rows | 10348 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| How much sleep do you get (hours)? | 2390 | 0.77 | 6.98 | 1.48 | 1 | 6 | 7 | 8 | 12 | ▁▂▇▆▁ |
With this, we added another 13 missing values to this variable, and we now do not have improper outliers that skew the data.
On the other hand, Annual Family Income is not
a continuous variable, and we need to code it as a categorical variable
(in R, these are called factors).
df %>% mutate(`Annual Family Income` = as_factor(`Annual Family Income`)) -> df
We can also add the labels to this factor.
df %>% mutate(`Annual Family Income` = case_when(`Annual Family Income` == 1 ~ "$ 0 to $ 4,999",
`Annual Family Income` == 2 ~ "$ 5,000 to $ 9,999",
`Annual Family Income` == 3 ~ "$10,000 to $14,999",
`Annual Family Income` == 4 ~ "$15,000 to $19,999",
`Annual Family Income` == 5 ~ "$20,000 to $24,999",
`Annual Family Income` == 6 ~ "$25,000 to $34,999",
`Annual Family Income` == 7 ~ "$35,000 to $44,999",
`Annual Family Income` == 8 ~ "$45,000 to $54,999",
`Annual Family Income` == 9 ~ "$55,000 to $64,999",
`Annual Family Income` == 10 ~ "$65,000 to $74,999",
`Annual Family Income` == 11 ~ "$75,000 and Over",
`Annual Family Income` == 12 ~ "Over $20,000",
`Annual Family Income` == 13 ~ "Under $20,000",
`Annual Family Income` == 77 ~ "Refused",
`Annual Family Income` == 99 ~ "Don't know"
)) %>%
mutate(`Annual Family Income` = as_factor(`Annual Family Income`)) -> df
Now, we can look at the distribution for this variable.
df %>% select(`Annual Family Income`) %>% pull() %>% summary()
## $15,000 to $19,999 $20,000 to $24,999 $65,000 to $74,999 $75,000 and Over
## 834 856 525 2080
## Over $20,000 $35,000 to $44,999 $ 0 to $ 4,999 $10,000 to $14,999
## 136 1011 347 864
## $45,000 to $54,999 $25,000 to $34,999 $55,000 to $64,999 $ 5,000 to $ 9,999
## 896 1338 575 488
## Under $20,000 Don't know Refused NA's
## 73 141 74 110
Other categorical variables were examined by the students. Tatiana
Soares looked at the Bed Surface Vacuumed,
Ever told by doctor have sleep disorder? and
Type of Floor Covering variables.
Let’s look at the documentation for these variables (Here, here and here)
Knowing what the values mean, let’s take a look at the variables themselves.
df %>% select(`Bed Surface Vacuumed`) %>% pull() %>% as_factor() %>% summary()
## 4 1 2 3 9 NA's
## 957 5415 110 447 6 3413
df %>% select(`Ever told by doctor have sleep disorder?`) %>% pull() %>% as_factor() %>% summary()
## 2 1 9 7 NA's
## 7477 486 6 2 2377
df %>% select(`Type of Floor Covering`) %>% pull() %>% as_factor() %>% summary()
## 1 2 3 9 4 NA's
## 5084 706 815 144 198 3401
With this, we can see there are missing data.
To underscore the importance of cleaning data, take a look at the
histogram for the How much sleep do you get (hours)? after
we clean the data.
## we're going to use ggplot to create a plot in R
## there are simpler ways to plot in R, but ggplot, when understood
## is very powerful and can create beautiful, detailed and custom plots
df %>% ggplot(aes(x = `How much sleep do you get (hours)?`)) +
geom_histogram(bins = 12)
I hope the above has been useful to you all. Remember, it always pays off to carefully examine the documentation for the database before proceeding with an analysis, as data needs to be properly understood and cleaned before it can be analysed, even if for descriptive purposes.