Tugas Kelompok Sains Data
Memanggil Package
library(tidyverse)## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.2 ✔ purrr 1.0.1
## ✔ tibble 3.2.1 ✔ dplyr 1.1.2
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Data
Global Country Information Dataset
Global Country Information Dataset merupakan kumpulan data komprehensif yang memberikan banyak informasi tentang semua negara di seluruh dunia, yang mencakup berbagai indikator dan atribut. Data ini mencakup statistik demografi, indikator ekonomi, faktor lingkungan, metrik layanan kesehatan, statistik pendidikan, dan banyak lagi. Dengan terwakilinya setiap negara, kumpulan data ini menawarkan perspektif global yang lengkap mengenai berbagai aspek suatu negara, memungkinkan analisis mendalam dan perbandingan lintas negara.
Berikut adalah penjelasan kolom-kolom yang ada di dalam data:
- Country: Name of the country.
- Density (P/Km2): Population density measured in persons per square kilometer.
- Abbreviation: Abbreviation or code representing the country.
- Agricultural Land (%): Percentage of land area used for agricultural purposes.
- Land Area (Km2): Total land area of the country in square kilometers.
- Armed Forces Size: Size of the armed forces in the country.
- Birth Rate: Number of births per 1,000 population per year.
- Calling Code: International calling code for the country.
- Capital/Major City: Name of the capital or major city.
- CO2 Emissions: Carbon dioxide emissions in tons.
- CPI: Consumer Price Index, a measure of inflation and purchasing power.
- CPI Change (%): Percentage change in the Consumer Price Index compared to the previous year.
- Currency_Code: Currency code used in the country.
- Fertility Rate: Average number of children born to a woman during her lifetime.
- Forested Area (%): Percentage of land area covered by forests.
- Gasoline_Price: Price of gasoline per liter in local currency.
- GDP: Gross Domestic Product, the total value of goods and services produced in the country.
- Gross Primary Education Enrollment (%): Gross enrollment ratio for primary education.
- Gross Tertiary Education Enrollment (%): Gross enrollment ratio for tertiary education. Infant
- Mortality: Number of deaths per 1,000 live births before reaching one year of age.
- Largest City: Name of the country’s largest city.
- Life Expectancy: Average number of years a newborn is expected to live.
- Maternal Mortality Ratio: Number of maternal deaths per 100,000 live births.
- Minimum Wage: Minimum wage level in local currency.
- Official Language: Official language(s) spoken in the country.
- Out of Pocket Health Expenditure (%): Percentage of total health expenditure paid out-of-pocket by individuals.
- Physicians per Thousand: Number of physicians per thousand people.
- Population: Total population of the country.
- Population: Labor Force Participation (%): Percentage of the population that is part of the labor force.
- Tax Revenue (%): Tax revenue as a percentage of GDP.
- Total Tax Rate: Overall tax burden as a percentage of commercial profits.
- Unemployment Rate: Percentage of the labor force that is unemployed.
- Urban Population: Percentage of the population living in urban areas.
- Latitude: Latitude coordinate of the country’s location.
- Longitude: Longitude coordinate of the country’s location.
Import Data
country_data <- read_csv("C:/Users/user/Downloads/archive(1)/world-data-2023.csv",show_col_types = FALSE)TUGAS KELOMPOK
1. Dalam R missing data biasanya ditandai dengan
NA, hitung berapa banyak missing value yang ada pada
dataset!
missing_counts <- country_data %>%
summarise(across(everything(), ~ sum(is.na(.))))
print(missing_counts)## # A tibble: 1 × 35
## Country `Density\n(P/Km2)` Abbreviation `Agricultural Land( %)`
## <int> <int> <int> <int>
## 1 0 0 7 7
## # ℹ 31 more variables: `Land Area(Km2)` <int>, `Armed Forces size` <int>,
## # `Birth Rate` <int>, `Calling Code` <int>, `Capital/Major City` <int>,
## # `Co2-Emissions` <int>, CPI <int>, `CPI Change (%)` <int>,
## # `Currency-Code` <int>, `Fertility Rate` <int>, `Forested Area (%)` <int>,
## # `Gasoline Price` <int>, GDP <int>,
## # `Gross primary education enrollment (%)` <int>,
## # `Gross tertiary education enrollment (%)` <int>, …
total_missing <- sum(is.na(country_data))
cat("Total missing value dalam dataset:", total_missing, "\n")## Total missing value dalam dataset: 337
2. Hapus semua missing data dan tunjukkan bahwa semua missing data sudah terhapus!
dataset_clean <- country_data %>%
drop_na()
# Memeriksa apakah semua missing data sudah terhapus
if (sum(is.na(dataset_clean)) == 0) {
cat("Semua missing data telah terhapus.\n")
} else {
cat("Masih ada missing data dalam dataset.\n")
}## Semua missing data telah terhapus.
3. Dalam dataset tersebut terdapat beberapa kolom yang
berbentuk persentase dan nilai mata uang dollar US yang bertipe data
chr. Gunakan fungsi mutate dan
across untuk mengubah tipe data kolom tersebut menjadi
dbl. Tunjukkan hasilnya dengan fungsi
glimpse!
character_set <- dataset_clean %>%
mutate(
Gasoline.Price = as.numeric(str_replace_all(str_replace_all(`Gasoline Price`, "\\$", ""), ",", "")),
GDP = as.numeric(str_replace_all(str_replace_all(GDP, "\\$", ""), ",", "")),
Minimum.wage= as.numeric(str_replace_all(str_replace_all(`Minimum wage`, "\\$", ""), ",", ""))
) %>%
select(Gasoline.Price,GDP,Minimum.wage) %>%
na.omit(character_set)
glimpse(character_set)## Rows: 110
## Columns: 3
## $ Gasoline.Price <dbl> 0.70, 1.36, 0.28, 0.97, 1.10, 0.77, 0.93, 0.56, 1.12, 1…
## $ GDP <dbl> 1.910135e+10, 1.527808e+10, 1.699882e+11, 9.463542e+10,…
## $ Minimum.wage <dbl> 0.43, 1.12, 0.95, 0.71, 3.35, 0.66, 13.59, 0.47, 0.51, …
4. Konversi nilai mata uang dalam bentuk dollar ke nilai
mata uang Rupiah dengan tipe data dbl!
nilai_tukar <- 15324
dataset <- character_set %>%
mutate(Gasoline.Price_Rupiah = Gasoline.Price * nilai_tukar,
GDP_Rupiah = GDP * nilai_tukar
)
dataset## # A tibble: 110 × 5
## Gasoline.Price GDP Minimum.wage Gasoline.Price_Rupiah GDP_Rupiah
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.7 19101353833 0.43 10727. 2.93e14
## 2 1.36 15278077447 1.12 20841. 2.34e14
## 3 0.28 169988236398 0.95 4291. 2.60e15
## 4 0.97 94635415870 0.71 14864. 1.45e15
## 5 1.1 449663446954 3.35 16856. 6.89e15
## 6 0.77 13672802158 0.66 11799. 2.10e14
## 7 0.93 1392680589329 13.6 14251. 2.13e16
## 8 0.56 39207000000 0.47 8581. 6.01e14
## 9 1.12 302571254131 0.51 17163. 4.64e15
## 10 1.81 5209000000 3.13 27736. 7.98e13
## # ℹ 100 more rows
5. Buatlah kolom country_status yang berisi
kategori "rich","developing",
"poor". Kategori ini didapatkan dengan kriteria sebagai
berikut: jika Unemployment rate kurang dari 1% maka negara
tersebut merupakan negara "rich", jika
Unemployment rate lebih besar dari 0.999999% dan kurang
dari 5% maka negara tersebut merupakan negara "developing".
Kemudian jika Unemployment rate lebih besar dari 4.999999%
maka negara tersebut merupakan negara"poor".
library(tidyverse)
data <- dataset_clean %>%
mutate(
dc_umprate = as.numeric(str_replace_all(`Unemployment rate`, "\\%", ""))
) %>%
select(Country, dc_umprate) %>%
na.omit()
# Membuat kolom "country_status" berdasarkan tingkat pengangguran
new_rate_dt <- data%>%
mutate(
country_status = case_when(
dc_umprate < 1.0 ~ "rich",
dc_umprate >= 1.0 & dc_umprate < 5.0 ~ "developing",
dc_umprate >= 5.0 ~ "poor"
)
)
# Menampilkan hasil
new_rate_dt## # A tibble: 110 × 3
## Country dc_umprate country_status
## <chr> <dbl> <chr>
## 1 Afghanistan 11.1 poor
## 2 Albania 12.3 poor
## 3 Algeria 11.7 poor
## 4 Angola 6.89 poor
## 5 Argentina 9.79 poor
## 6 Armenia 17.0 poor
## 7 Australia 5.27 poor
## 8 Azerbaijan 5.51 poor
## 9 Bangladesh 4.19 developing
## 10 Barbados 10.3 poor
## # ℹ 100 more rows
6. Hitunglah rata-rata dan nilai maximum GDP dan CPI
berdasarkan country status! Kemudian hitung juga berapa banyak negara
yang termasuk dalam kategori
"rich","developing",
"poor".
combined_data <- dataset_clean %>%
mutate(
dc_umprate = as.numeric(str_replace_all(`Unemployment rate`, "\\%", "")),
cpi.c = as.numeric(str_replace_all(`CPI Change (%)`, "\\%", "")),
GDP = as.numeric(str_replace_all(str_replace_all(GDP, "\\$", ""), ",", ""))
) %>%
select(Country, dc_umprate, cpi.c, GDP) %>%
left_join(new_rate_dt, by = "Country")
# Menghitung rata-rata dan nilai maksimum GDP dan CPI berdasarkan country_status
summary_data <- combined_data %>%
group_by(country_status) %>%
summarise(
Avg_GDP = mean(GDP, na.rm = TRUE),
Max_GDP = max(GDP, na.rm = TRUE),
Avg_CPI = mean(cpi.c, na.rm = TRUE),
Max_CPI = max(cpi.c, na.rm = TRUE),
Count = n()
)
# Menampilkan hasil perhitungan
print(summary_data)## # A tibble: 3 × 6
## country_status Avg_GDP Max_GDP Avg_CPI Max_CPI Count
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 developing 820523892191. 1.99e13 3.22 14.8 46
## 2 poor 647889907874. 2.14e13 6.24 53.5 61
## 3 rich 191583986805. 5.44e11 0.5 3.3 3
7. Tampilkan hasil nomor 6 dalam format long!
summary_data_long <- summary_data %>%
pivot_longer(
cols = c(Avg_GDP, Max_GDP, Avg_CPI, Max_CPI),
names_to = "Variable",
values_to = "Value"
)
# Menampilkan hasil dalam format long
print(summary_data_long)## # A tibble: 12 × 4
## country_status Count Variable Value
## <chr> <int> <chr> <dbl>
## 1 developing 46 Avg_GDP 8.21e+11
## 2 developing 46 Max_GDP 1.99e+13
## 3 developing 46 Avg_CPI 3.22e+ 0
## 4 developing 46 Max_CPI 1.48e+ 1
## 5 poor 61 Avg_GDP 6.48e+11
## 6 poor 61 Max_GDP 2.14e+13
## 7 poor 61 Avg_CPI 6.24e+ 0
## 8 poor 61 Max_CPI 5.35e+ 1
## 9 rich 3 Avg_GDP 1.92e+11
## 10 rich 3 Max_GDP 5.44e+11
## 11 rich 3 Avg_CPI 5 e- 1
## 12 rich 3 Max_CPI 3.3 e+ 0