Disclaimer: The intention behind creating this document is to remember the lines of R code that were used in a project and to better understand its implementation. Above all, this document exists for me to remember the code I have written for dealing with a particular situation 😅 . I will make sure in future to use reprex and produce reproducible examples of all the errors I get along the way, no matter how silly they are.
Having said that, feel free to use any piece of code presented here and make as many mistakes as you can.
Libraries
library(dplyr)
library(tidyr)
library(ggplot2)
Data wrangling: So here we have a simple IVF data-set with patient ID’s, age, total number of oocytes and total number of matured oocytes. Now we need to group the age which is a discrete variable, into certain age groups, which is a common operation when dealing with raw data.
So, this solution can be used to form groups from raw
data for any kind of discrete variable. Here since we want to
form multiple groups we have multiple conditionals so we use
case_when
data_new <- data |>
select(Age) |>
mutate(age_group = case_when(
Age <= 30 ~ "26-30",
Age <= 35 ~ "31-35",
Age <= 40 ~ "36-40",
Age <= 45 ~ "41-45"
))
select() function is used to select particular
columns in the data-set. It can be used to select one column
select(col_name) or alternatively to select multiple
columns select(col_name_1,col_name_2,...).
mutate() function is used to make new variables
based on a certain condition. mutate is used in
such manner:
mutate(new variable = condition)
case_when works in the following manner:
case_when(
original_variable with condition ~ "New variable",
. ,
. ,
.
)
If we want to form two groups then we can use if_else as
a condition inside mutation.
data_new <- data |>
select(Age) |>
mutate(age_group = if_else(
Age <= 30, "less than", "greater than"
))
if_else works in the following manner:
if_else(original variable with condition, VALUE_when_true, VALUE_when_false)
Data-visualization: So now we want to visualize the data set for some variables and get a scatter plot along-with a smoothed trend curve.
So to get the data visualization done we use ggplot2 package and the code is as follows:
ggplot(data = dataset,
aes(x = x_var,y=y_var))+
geom_point(shape = "diamond filled", fill = "colour_code", size = 2, stroke = 2)+
geom_smooth(method="auto",fill = "colour_code", colour = "colour_code for line")+
labs(title = ".....",
x = ".....",
y = ".....")+
scale_x_continuous(limits = c(26,42), breaks = seq(26,42,by=2))+
theme(plot.title = element_text(hjust=0.5))
Here, in geom_point(), shape is used to
create different shapes of the points. There are 4 values
shape argument can take, these are: “diamond
filled”, “triangle filled”,“square filled”, circle filled”.
fill is used to fill the color inside the point.
size determines size of the point whereas
stroke determines the width of the border of the shape.
Again, the thing of interest here is that we can change the
range and width of scale in graph using
scale_x_continuous(). and
theme(plot.title = element_text(hjust=0.5) is used to
center the title of the graph.
Now we need to summarize the basic attributes of a particular variable.
This can be done by using summarise function from
dplyr.
summarise(dataset, avg = mean(variable_name)) # for mean
same can be used to find standard deviation (sd),
minimum(min), maximum (max), median
(median), variance (var), number of values
(n) and number of distinct values in a vector
(n_distinct).