Report overview
This is report of the data on tidy_week1. The report presents solutions to tasks from “tidy_week1”. Each task is also presented as a Python solution.
Exercises
Exercise 1 - gather()
Transforming data “student_data” to long format using gather() function.
## student_id name exam_type results
## 1 1 Student1 midterm1 96
## 2 2 Student2 midterm1 70
## 3 3 Student3 midterm1 91
## 4 4 Student4 midterm1 98
## 5 5 Student5 midterm1 78
## 6 6 Student6 midterm1 70
Providing both python and R language codes with the same output for
comparison:
R:
student_data_long <- student_data %>%
gather(key = exam_type, value = results, -name, -student_id)
Python:
student_data_long = student_data.melt(
id_vars=['student_id', 'name'],
var_name='exam_type',
value_name='results')
Exercise 2 - separate()
Transforming data “student_data” using separate() function, and reshaping it to long format
## # A tibble: 150 × 5
## student_id name age exam score
## <int> <chr> <chr> <chr> <chr>
## 1 1 John 21 midterm1 80
## 2 1 John 21 midterm2 85
## 3 1 John 21 final 75
## 4 2 Alice 20 midterm1 75
## 5 2 Alice 20 midterm2 78
## 6 2 Alice 20 final 80
## 7 3 Bob 22 midterm1 82
## 8 3 Bob 22 midterm2 80
## 9 3 Bob 22 final 85
## 10 4 Emily 23 midterm1 88
## # ℹ 140 more rows
Providing both python and R language codes with the same output for
comparison:
R:
student_data <- separate(student_data, name_age, into=c("name","age"), sep="_")%>%
mutate(row = row_number()) %>%
separate_rows(exam_scores, sep = ",") %>%
separate(exam_scores, into = c("exam", "score"), sep = "_")
Python:
student_data[['name', 'age']] = student_data['name_age'].str.split('_', expand=True)
student_data[['midterm1', 'midterm2', 'final']] = student_data['exam_scores'].str.split(',', expand=True)
Exercise 3 - complete()
Compliting data “student_data” using complete() function, and tidying it.
## # A tibble: 20 × 6
## name age student_id midterm1 midterm2 final
## <chr> <chr> <int> <chr> <chr> <chr>
## 1 John 21 1 midterm1_80 midterm2_85 final_75
## 2 Alice 20 2 midterm1_75 midterm2_78 final_80
## 3 Bob 22 3 midterm1_82 midterm2_80 final_85
## 4 Emily 23 4 midterm1_88 midterm2_90 final_92
## 5 Michael 22 5 midterm1_85 midterm2_86 final_88
## 6 Alice 21 6 0 0 0
## 7 Alice 22 7 0 0 0
## 8 Alice 23 8 0 0 0
## 9 Bob 20 9 0 0 0
## 10 Bob 21 10 0 0 0
## 11 Bob 23 11 0 0 0
## 12 Emily 20 12 0 0 0
## 13 Emily 21 13 0 0 0
## 14 Emily 22 14 0 0 0
## 15 John 20 15 0 0 0
## 16 John 22 16 0 0 0
## 17 John 23 17 0 0 0
## 18 Michael 20 18 0 0 0
## 19 Michael 21 19 0 0 0
## 20 Michael 23 20 0 0 0
Providing both python and R language codes with the same output for
comparison:
R:
student_data <- student_data %>%
complete(name, age, fill = list(midterm1 = "0", midterm2 = "0", final = "0"), explicit = TRUE)
collapsed_data <- distinct(student_data, name, age, midterm1, midterm2, final, .keep_all = TRUE)
collapsed_data <- collapsed_data %>% arrange(student_id)
collapsed_data <- collapsed_data %>%
mutate(student_id = seq_along(student_id))
Python:
all_combinations = pd.DataFrame({
'name': student_data['name'].unique(),
'age': [str(age) for age in range(20, 24)]
})
student_data = all_combinations.merge(student_data, on=['name', 'age'], how='left')
student_data = student_data.fillna('0')
collapsed_data = student_data.drop_duplicates()
collapsed_data = collapsed_data.sort_values(by='student_id')
collapsed_data['student_id'] = range(1, len(collapsed_data) + 1)
Exercise 4 - spread()
Transform “student_data” from long format to wide format using spread() (in our case function called pivot.wider)
## # A tibble: 50 × 5
## student_id name_age midterm1 midterm2 final
## <int> <chr> <chr> <chr> <chr>
## 1 1 John_21 80 85 75
## 2 2 Alice_20 75 78 80
## 3 3 Bob_22 82 80 85
## 4 4 Emily_23 88 90 92
## 5 5 Michael_22 85 86 88
## 6 6 John_21 80 85 75
## 7 7 Alice_20 75 78 80
## 8 8 Bob_22 82 80 85
## 9 9 Emily_23 88 90 92
## 10 10 Michael_22 85 86 88
## # ℹ 40 more rows
Providing both python and R language codes with the same output for
comparison:
R:
student_data <- student_data %>%
pivot_wider(names_from = c("midterm1", "midterm2", "final"), values_from = c("result1", "result2", "result3"))
Python:
student_data = student_data.pivot_table(index=['student_id', 'name', 'age'],
values=['midterm1', 'midterm2', 'final'],
aggfunc='first').reset_index()
Exercise 5 - unite()
Uniting columns in data frame called “student_data” using unite() function.
## student_id name_age exam_scores
## 1 1 John_21 midterm1_80,midterm2_85,final_75
## 2 2 Alice_20 midterm1_75,midterm2_78,final_80
## 3 3 Bob_22 midterm1_82,midterm2_80,final_85
## 4 4 Emily_23 midterm1_88,midterm2_90,final_92
## 5 5 Michael_22 midterm1_85,midterm2_86,final_88
## 6 6 John_21 midterm1_80,midterm2_85,final_75
Providing both python and R language codes with the same output for
comparison:
R:
student_data <- student_data %>% unite(name_age, name, age, sep = "_")
Python:
student_data["name_age"] = student_data["name"].astype(str) + "_" + student_data["age"].astype(str)
Report summary
This report was created to show how to solve the tasks from tidy_week1 report. Moreover, it shows the difference between R and Python language in the context of the above tasks.