DataWrangling

Jakub Flizikowski, Katarzyna Batlinska, Kacper Dziurgot

2024-03-14

Report overview

This is report of the data on tidy_week1. The report presents solutions to tasks from “tidy_week1”. Each task is also presented as a Python solution.

Exercises

Exercise 1 - gather()

Transforming data “student_data” to long format using gather() function.

##   student_id     name exam_type results
## 1          1 Student1  midterm1      96
## 2          2 Student2  midterm1      70
## 3          3 Student3  midterm1      91
## 4          4 Student4  midterm1      98
## 5          5 Student5  midterm1      78
## 6          6 Student6  midterm1      70

Providing both python and R language codes with the same output for comparison:
R:

student_data_long <- student_data %>% 
    gather(key = exam_type, value = results, -name, -student_id)

Python:

student_data_long = student_data.melt(
  id_vars=['student_id', 'name'], 
  var_name='exam_type', 
  value_name='results')

Exercise 2 - separate()

Transforming data “student_data” using separate() function, and reshaping it to long format

## # A tibble: 150 × 5
##    student_id name  age   exam     score
##         <int> <chr> <chr> <chr>    <chr>
##  1          1 John  21    midterm1 80   
##  2          1 John  21    midterm2 85   
##  3          1 John  21    final    75   
##  4          2 Alice 20    midterm1 75   
##  5          2 Alice 20    midterm2 78   
##  6          2 Alice 20    final    80   
##  7          3 Bob   22    midterm1 82   
##  8          3 Bob   22    midterm2 80   
##  9          3 Bob   22    final    85   
## 10          4 Emily 23    midterm1 88   
## # ℹ 140 more rows

Providing both python and R language codes with the same output for comparison:
R:

student_data <- separate(student_data, name_age, into=c("name","age"), sep="_")%>%
  mutate(row = row_number()) %>%
  separate_rows(exam_scores, sep = ",") %>%
  separate(exam_scores, into = c("exam", "score"), sep = "_")

Python:

student_data[['name', 'age']] = student_data['name_age'].str.split('_', expand=True)
student_data[['midterm1', 'midterm2', 'final']] = student_data['exam_scores'].str.split(',', expand=True)

Exercise 3 - complete()

Compliting data “student_data” using complete() function, and tidying it.

## # A tibble: 20 × 6
##    name    age   student_id midterm1    midterm2    final   
##    <chr>   <chr>      <int> <chr>       <chr>       <chr>   
##  1 John    21             1 midterm1_80 midterm2_85 final_75
##  2 Alice   20             2 midterm1_75 midterm2_78 final_80
##  3 Bob     22             3 midterm1_82 midterm2_80 final_85
##  4 Emily   23             4 midterm1_88 midterm2_90 final_92
##  5 Michael 22             5 midterm1_85 midterm2_86 final_88
##  6 Alice   21             6 0           0           0       
##  7 Alice   22             7 0           0           0       
##  8 Alice   23             8 0           0           0       
##  9 Bob     20             9 0           0           0       
## 10 Bob     21            10 0           0           0       
## 11 Bob     23            11 0           0           0       
## 12 Emily   20            12 0           0           0       
## 13 Emily   21            13 0           0           0       
## 14 Emily   22            14 0           0           0       
## 15 John    20            15 0           0           0       
## 16 John    22            16 0           0           0       
## 17 John    23            17 0           0           0       
## 18 Michael 20            18 0           0           0       
## 19 Michael 21            19 0           0           0       
## 20 Michael 23            20 0           0           0

Providing both python and R language codes with the same output for comparison:
R:

student_data <- student_data %>% 
  complete(name, age, fill = list(midterm1 = "0", midterm2 = "0", final = "0"), explicit = TRUE)
collapsed_data <- distinct(student_data, name, age, midterm1, midterm2, final, .keep_all = TRUE)
collapsed_data <- collapsed_data %>% arrange(student_id)

collapsed_data <- collapsed_data %>%
  mutate(student_id = seq_along(student_id))

Python:

all_combinations = pd.DataFrame({
    'name': student_data['name'].unique(),
    'age': [str(age) for age in range(20, 24)] 
})

student_data = all_combinations.merge(student_data, on=['name', 'age'], how='left')
student_data = student_data.fillna('0')
collapsed_data = student_data.drop_duplicates()
collapsed_data = collapsed_data.sort_values(by='student_id')
collapsed_data['student_id'] = range(1, len(collapsed_data) + 1)

Exercise 4 - spread()

Transform “student_data” from long format to wide format using spread() (in our case function called pivot.wider)

## # A tibble: 50 × 5
##    student_id name_age   midterm1 midterm2 final
##         <int> <chr>      <chr>    <chr>    <chr>
##  1          1 John_21    80       85       75   
##  2          2 Alice_20   75       78       80   
##  3          3 Bob_22     82       80       85   
##  4          4 Emily_23   88       90       92   
##  5          5 Michael_22 85       86       88   
##  6          6 John_21    80       85       75   
##  7          7 Alice_20   75       78       80   
##  8          8 Bob_22     82       80       85   
##  9          9 Emily_23   88       90       92   
## 10         10 Michael_22 85       86       88   
## # ℹ 40 more rows

Providing both python and R language codes with the same output for comparison:
R:

student_data <- student_data %>% 
pivot_wider(names_from = c("midterm1", "midterm2", "final"), values_from = c("result1", "result2", "result3"))

Python:

student_data = student_data.pivot_table(index=['student_id', 'name', 'age'],
                                        values=['midterm1', 'midterm2', 'final'],
                                        aggfunc='first').reset_index()

Exercise 5 - unite()

Uniting columns in data frame called “student_data” using unite() function.

##   student_id   name_age                      exam_scores
## 1          1    John_21 midterm1_80,midterm2_85,final_75
## 2          2   Alice_20 midterm1_75,midterm2_78,final_80
## 3          3     Bob_22 midterm1_82,midterm2_80,final_85
## 4          4   Emily_23 midterm1_88,midterm2_90,final_92
## 5          5 Michael_22 midterm1_85,midterm2_86,final_88
## 6          6    John_21 midterm1_80,midterm2_85,final_75

Providing both python and R language codes with the same output for comparison:
R:

student_data <- student_data %>% unite(name_age, name, age, sep = "_")

Python:

student_data["name_age"] = student_data["name"].astype(str) + "_" + student_data["age"].astype(str)

Report summary

This report was created to show how to solve the tasks from tidy_week1 report. Moreover, it shows the difference between R and Python language in the context of the above tasks.