Part-1

Disclaimer: The intention behind creating this document is to remember the lines of R code that were used in a project and to better understand its implementation. Above all, this document exists for me to remember the code I have written for dealing with a particular situation 😅 . I will make sure in future to use reprex and produce reproducible examples of all the errors I get along the way, no matter how silly they are.

Having said that, feel free to use any piece of code presented here and make as many mistakes as you can.

Libraries

library(dplyr)
library(tidyr)
library(ggplot2)

Problem-1

Data wrangling: So here we have a simple IVF data-set with patient ID’s, age, total number of oocytes and total number of matured oocytes. Now we need to group the age which is a discrete variable, into certain age groups, which is a common operation when dealing with raw data.

Solution-1

So, this solution can be used to form groups from raw data for any kind of discrete variable. Here since we want to form multiple groups we have multiple conditionals so we use case_when

data_new <- data |>
  select(Age) |>
  mutate(age_group = case_when(
    Age <= 30 ~ "26-30",
    Age <= 35 ~ "31-35",
    Age <= 40 ~ "36-40",
    Age <= 45 ~ "41-45"
  ))

Code explanation

select() function is used to select particular columns in the data-set. It can be used to select one column select(col_name) or alternatively to select multiple columns select(col_name_1,col_name_2,...). mutate() function is used to make new variables based on a certain condition. mutate is used in such manner:

mutate(new variable = condition)

case_when works in the following manner:

case_when(
  original_variable with condition ~ "New variable",
  . ,
  . ,
  .
)

Solution-2

If we want to form two groups then we can use if_else as a condition inside mutation.

data_new <- data |>
  select(Age) |>
  mutate(age_group = if_else(
    Age <= 30, "less than", "greater than"
  ))

Code explanation

if_else works in the following manner:

if_else(original variable with condition, VALUE_when_true, VALUE_when_false)

Problem-2

Data-visualization: So now we want to visualize the data set for some variables and get a scatter plot along-with a smoothed trend curve.

Solution-1

So to get the data visualization done we use ggplot2 package and the code is as follows:

ggplot(data = dataset,
       aes(x = x_var,y=y_var))+
  geom_point(shape = "diamond filled", fill = "colour_code", size = 2, stroke = 2)+
  geom_smooth(method="auto",fill = "colour_code", colour = "colour_code for line")+
  labs(title = ".....",
       x = ".....",
       y = ".....")+
  scale_x_continuous(limits = c(26,42), breaks = seq(26,42,by=2))+
  theme(plot.title = element_text(hjust=0.5))

Here, in geom_point(), shape is used to create different shapes of the points. There are 4 values shape argument can take, these are: “diamond filled”, “triangle filled”,“square filled”, circle filled”. fill is used to fill the color inside the point. size determines size of the point whereas stroke determines the width of the border of the shape.

Again, the thing of interest here is that we can change the range and width of scale in graph using scale_x_continuous(). and theme(plot.title = element_text(hjust=0.5) is used to center the title of the graph.

Problem-3

Now we need to summarize the basic attributes of a particular variable.

Solution-1

This can be done by using summarise function from dplyr.

summarise(dataset, avg = mean(variable_name)) # for mean

same can be used to find standard deviation (sd), minimum(min), maximum (max), median (median), variance (var), number of values (n) and number of distinct values in a vector (n_distinct).

Exploring my way through forest of data analysis - a beginner’s guide to simple data wrangling and data visualisation.

Utpal Shetty

2024-08-14

Part-1

Problem-1

Solution-1

Code explanation

Solution-2

Code explanation

Problem-2

Solution-1

Problem-3

Solution-1