I chose to use my own dataset for this assignment.
Let’s simulate the data:
set.seed(123)
# Create sample data
df <- data.frame(
id = 1:5,
gender = sample(c("Male", "Female"), 5, replace = TRUE),
age = sample(18:65, 5, replace = TRUE),
weight = sample(120:220, 5, replace = TRUE),
height = sample(150:200, 5, replace = TRUE),
systolic_1 = sample(100:140, 5, replace = TRUE),
diastolic_1 = sample(60:90, 5, replace = TRUE),
systolic_2 = sample(100:140, 5, replace = TRUE),
diastolic_2 = sample(60:90, 5, replace = TRUE),
smoking_status = sample(c("Never smoked", "Current smoker", "Former smoker"), 5, replace = TRUE)
)
# Make some of the data missing
df[sample(1:5, 3), 5:8] <- NA
df[sample(1:5, 2), 10] <- NA
# write the data to a csv file:
write.csv(
x = df,
file = "blood_pressure_and_demographics.csv",
row.names = FALSE
)
Read the data into R:
blood_pressure_and_demographics <- read.csv(
file = "blood_pressure_and_demographics.csv"
)
# look at the first 10 rows:
blood_pressure_and_demographics
## id gender age weight height systolic_1 diastolic_1 systolic_2 diastolic_2
## 1 1 Male 59 209 NA NA NA NA 68
## 2 2 Male 60 210 NA NA NA NA 69
## 3 3 Male 54 188 178 108 70 131 82
## 4 4 Female 31 210 184 118 66 106 86
## 5 5 Male 42 176 NA NA NA NA 87
## smoking_status
## 1 Never smoked
## 2 Former smoker
## 3 Never smoked
## 4 <NA>
## 5 <NA>
Time to tidy and transform the data.
Load the libraries to use:
library(dplyr)
library(tidyr)
First, reshape the dataset into long format:
df_long <- blood_pressure_and_demographics %>%
pivot_longer(cols = c(systolic_1, diastolic_1, systolic_2, diastolic_2),
names_to = c(".value", "visit_number"),
names_sep = "_") %>%
mutate(visit_number = as.numeric(visit_number)) %>%
arrange(id, visit_number)
df_long
## # A tibble: 10 × 9
## id gender age weight height smoking_status visit_number systolic diast…¹
## <int> <chr> <int> <int> <int> <chr> <dbl> <int> <int>
## 1 1 Male 59 209 NA Never smoked 1 NA NA
## 2 1 Male 59 209 NA Never smoked 2 NA 68
## 3 2 Male 60 210 NA Former smoker 1 NA NA
## 4 2 Male 60 210 NA Former smoker 2 NA 69
## 5 3 Male 54 188 178 Never smoked 1 108 70
## 6 3 Male 54 188 178 Never smoked 2 131 82
## 7 4 Female 31 210 184 <NA> 1 118 66
## 8 4 Female 31 210 184 <NA> 2 106 86
## 9 5 Male 42 176 NA <NA> 1 NA NA
## 10 5 Male 42 176 NA <NA> 2 NA 87
## # … with abbreviated variable name ¹diastolic
Calculate the mean systolic and diastolic blood pressure for each individual and visit.
To do that, group the dataframe by id and
visit_number and summarize each group via mean.
Finally drop the groups.
bp_means <- df_long %>%
group_by(id, visit_number) %>%
summarize(
mean_systolic = mean(systolic),
mean_diastolic = mean(diastolic),
n_obs = n(),
.groups = "drop"
)
bp_means
## # A tibble: 10 × 5
## id visit_number mean_systolic mean_diastolic n_obs
## <int> <dbl> <dbl> <dbl> <int>
## 1 1 1 NA NA 1
## 2 1 2 NA 68 1
## 3 2 1 NA NA 1
## 4 2 2 NA 69 1
## 5 3 1 108 70 1
## 6 3 2 131 82 1
## 7 4 1 118 66 1
## 8 4 2 106 86 1
## 9 5 1 NA NA 1
## 10 5 2 NA 87 1
Join the blood pressure means back to the original dataset by
id and visit_number:
df_clean <- df_long %>%
select(-c(systolic, diastolic)) %>%
left_join(bp_means, by = c("id", "visit_number"))
df_clean
## # A tibble: 10 × 10
## id gender age weight height smoking_status visit…¹ mean_…² mean_…³ n_obs
## <int> <chr> <int> <int> <int> <chr> <dbl> <dbl> <dbl> <int>
## 1 1 Male 59 209 NA Never smoked 1 NA NA 1
## 2 1 Male 59 209 NA Never smoked 2 NA 68 1
## 3 2 Male 60 210 NA Former smoker 1 NA NA 1
## 4 2 Male 60 210 NA Former smoker 2 NA 69 1
## 5 3 Male 54 188 178 Never smoked 1 108 70 1
## 6 3 Male 54 188 178 Never smoked 2 131 82 1
## 7 4 Female 31 210 184 <NA> 1 118 66 1
## 8 4 Female 31 210 184 <NA> 2 106 86 1
## 9 5 Male 42 176 NA <NA> 1 NA NA 1
## 10 5 Male 42 176 NA <NA> 2 NA 87 1
## # … with abbreviated variable names ¹visit_number, ²mean_systolic,
## # ³mean_diastolic
The final output is a cleaned and transformed dataset that is ready for downstream analysis.
In this week’s discussion, you’re asked to find, discuss and cite another reference that shows how data can be used to provide insights to improving collaboration or another “soft skill” that is relevant to data scientists.
Data can be used to provide insights to improving collaboration within an organization. Decentralized decision-makers empowered with data insights and local knowledge can collaborate with different stakeholders, including customers and even competitors, to create the best outcome for the business.
A collaborative culture enabled by data insights allows employees to share insights and best practices across the organization and make the best decisions at the right moment.
Therefore, fostering employee collaboration and promoting the usage of advanced analytics across the business can improve collaboration, a soft skill relevant to data scientists.
Van Rijmenam, M. (2019, July 26). How Big Data will Drive Collaboration and Empowerment. Medium. Retrieved March 6, 2023, from https://medium.com/swlh/how-big-data-will-drive-collaboration-and-empowerment-d3bd7bebbcbc#