Assignment 6 - Project 2

I chose to use my own dataset for this assignment.

Let’s simulate the data:

set.seed(123)

# Create sample data
df <- data.frame(
  id = 1:5,
  gender = sample(c("Male", "Female"), 5, replace = TRUE),
  age = sample(18:65, 5, replace = TRUE),
  weight = sample(120:220, 5, replace = TRUE),
  height = sample(150:200, 5, replace = TRUE),
  systolic_1 = sample(100:140, 5, replace = TRUE),
  diastolic_1 = sample(60:90, 5, replace = TRUE),
  systolic_2 = sample(100:140, 5, replace = TRUE),
  diastolic_2 = sample(60:90, 5, replace = TRUE),
  smoking_status = sample(c("Never smoked", "Current smoker", "Former smoker"), 5, replace = TRUE)
)

# Make some of the data missing
df[sample(1:5, 3), 5:8] <- NA
df[sample(1:5, 2), 10] <- NA

# write the data to a csv file:
write.csv(
  x = df,
  file = "blood_pressure_and_demographics.csv",
  row.names = FALSE
)

Read the data into R:

blood_pressure_and_demographics <- read.csv(
  file = "blood_pressure_and_demographics.csv"
)

# look at the first 10 rows:
blood_pressure_and_demographics

##   id gender age weight height systolic_1 diastolic_1 systolic_2 diastolic_2
## 1  1   Male  59    209     NA         NA          NA         NA          68
## 2  2   Male  60    210     NA         NA          NA         NA          69
## 3  3   Male  54    188    178        108          70        131          82
## 4  4 Female  31    210    184        118          66        106          86
## 5  5   Male  42    176     NA         NA          NA         NA          87
##   smoking_status
## 1   Never smoked
## 2  Former smoker
## 3   Never smoked
## 4           <NA>
## 5           <NA>

Time to tidy and transform the data.

Load the libraries to use:

library(dplyr)
library(tidyr)

First, reshape the dataset into long format:

df_long <- blood_pressure_and_demographics %>%
  pivot_longer(cols = c(systolic_1, diastolic_1, systolic_2, diastolic_2),
               names_to = c(".value", "visit_number"),
               names_sep = "_") %>%
  mutate(visit_number = as.numeric(visit_number)) %>%
  arrange(id, visit_number)

df_long

## # A tibble: 10 × 9
##       id gender   age weight height smoking_status visit_number systolic diast…¹
##    <int> <chr>  <int>  <int>  <int> <chr>                 <dbl>    <int>   <int>
##  1     1 Male      59    209     NA Never smoked              1       NA      NA
##  2     1 Male      59    209     NA Never smoked              2       NA      68
##  3     2 Male      60    210     NA Former smoker             1       NA      NA
##  4     2 Male      60    210     NA Former smoker             2       NA      69
##  5     3 Male      54    188    178 Never smoked              1      108      70
##  6     3 Male      54    188    178 Never smoked              2      131      82
##  7     4 Female    31    210    184 <NA>                      1      118      66
##  8     4 Female    31    210    184 <NA>                      2      106      86
##  9     5 Male      42    176     NA <NA>                      1       NA      NA
## 10     5 Male      42    176     NA <NA>                      2       NA      87
## # … with abbreviated variable name ¹diastolic

Calculate the mean systolic and diastolic blood pressure for each individual and visit.

To do that, group the dataframe by id and visit_number and summarize each group via mean.

Finally drop the groups.

bp_means <- df_long %>%
  group_by(id, visit_number) %>%
  summarize(
    mean_systolic = mean(systolic),
    mean_diastolic = mean(diastolic),
    n_obs = n(),
    .groups = "drop"
  )

bp_means

## # A tibble: 10 × 5
##       id visit_number mean_systolic mean_diastolic n_obs
##    <int>        <dbl>         <dbl>          <dbl> <int>
##  1     1            1            NA             NA     1
##  2     1            2            NA             68     1
##  3     2            1            NA             NA     1
##  4     2            2            NA             69     1
##  5     3            1           108             70     1
##  6     3            2           131             82     1
##  7     4            1           118             66     1
##  8     4            2           106             86     1
##  9     5            1            NA             NA     1
## 10     5            2            NA             87     1

Join the blood pressure means back to the original dataset by id and visit_number:

df_clean <- df_long %>%
  select(-c(systolic, diastolic)) %>%
  left_join(bp_means, by = c("id", "visit_number"))

df_clean

## # A tibble: 10 × 10
##       id gender   age weight height smoking_status visit…¹ mean_…² mean_…³ n_obs
##    <int> <chr>  <int>  <int>  <int> <chr>            <dbl>   <dbl>   <dbl> <int>
##  1     1 Male      59    209     NA Never smoked         1      NA      NA     1
##  2     1 Male      59    209     NA Never smoked         2      NA      68     1
##  3     2 Male      60    210     NA Former smoker        1      NA      NA     1
##  4     2 Male      60    210     NA Former smoker        2      NA      69     1
##  5     3 Male      54    188    178 Never smoked         1     108      70     1
##  6     3 Male      54    188    178 Never smoked         2     131      82     1
##  7     4 Female    31    210    184 <NA>                 1     118      66     1
##  8     4 Female    31    210    184 <NA>                 2     106      86     1
##  9     5 Male      42    176     NA <NA>                 1      NA      NA     1
## 10     5 Male      42    176     NA <NA>                 2      NA      87     1
## # … with abbreviated variable names ¹visit_number, ²mean_systolic,
## #   ³mean_diastolic

The final output is a cleaned and transformed dataset that is ready for downstream analysis.

This week’s discussion

In this week’s discussion, you’re asked to find, discuss and cite another reference that shows how data can be used to provide insights to improving collaboration or another “soft skill” that is relevant to data scientists.

Solution

Data can be used to provide insights to improving collaboration within an organization. Decentralized decision-makers empowered with data insights and local knowledge can collaborate with different stakeholders, including customers and even competitors, to create the best outcome for the business.

A collaborative culture enabled by data insights allows employees to share insights and best practices across the organization and make the best decisions at the right moment.

Therefore, fostering employee collaboration and promoting the usage of advanced analytics across the business can improve collaboration, a soft skill relevant to data scientists.

Reference

Van Rijmenam, M. (2019, July 26). How Big Data will Drive Collaboration and Empowerment. Medium. Retrieved March 6, 2023, from https://medium.com/swlh/how-big-data-will-drive-collaboration-and-empowerment-d3bd7bebbcbc#

Assignment 6 - Project 2

Mohammed Rahman

2023-03-06

This week’s discussion

Solution

Reference