Tugas Kelompok Sains Data

Memanggil Package

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.2     ✔ purrr   1.0.1
## ✔ tibble  3.2.1     ✔ dplyr   1.1.2
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 0.5.2
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Data

Global Country Information Dataset

Global Country Information Dataset merupakan kumpulan data komprehensif yang memberikan banyak informasi tentang semua negara di seluruh dunia, yang mencakup berbagai indikator dan atribut. Data ini mencakup statistik demografi, indikator ekonomi, faktor lingkungan, metrik layanan kesehatan, statistik pendidikan, dan banyak lagi. Dengan terwakilinya setiap negara, kumpulan data ini menawarkan perspektif global yang lengkap mengenai berbagai aspek suatu negara, memungkinkan analisis mendalam dan perbandingan lintas negara.

Berikut adalah penjelasan kolom-kolom yang ada di dalam data:

  1. Country: Name of the country.
  2. Density (P/Km2): Population density measured in persons per square kilometer.
  3. Abbreviation: Abbreviation or code representing the country.
  4. Agricultural Land (%): Percentage of land area used for agricultural purposes.
  5. Land Area (Km2): Total land area of the country in square kilometers.
  6. Armed Forces Size: Size of the armed forces in the country.
  7. Birth Rate: Number of births per 1,000 population per year.
  8. Calling Code: International calling code for the country.
  9. Capital/Major City: Name of the capital or major city.
  10. CO2 Emissions: Carbon dioxide emissions in tons.
  11. CPI: Consumer Price Index, a measure of inflation and purchasing power.
  12. CPI Change (%): Percentage change in the Consumer Price Index compared to the previous year.
  13. Currency_Code: Currency code used in the country.
  14. Fertility Rate: Average number of children born to a woman during her lifetime.
  15. Forested Area (%): Percentage of land area covered by forests.
  16. Gasoline_Price: Price of gasoline per liter in local currency.
  17. GDP: Gross Domestic Product, the total value of goods and services produced in the country.
  18. Gross Primary Education Enrollment (%): Gross enrollment ratio for primary education.
  19. Gross Tertiary Education Enrollment (%): Gross enrollment ratio for tertiary education. Infant
  20. Mortality: Number of deaths per 1,000 live births before reaching one year of age.
  21. Largest City: Name of the country’s largest city.
  22. Life Expectancy: Average number of years a newborn is expected to live.
  23. Maternal Mortality Ratio: Number of maternal deaths per 100,000 live births.
  24. Minimum Wage: Minimum wage level in local currency.
  25. Official Language: Official language(s) spoken in the country.
  26. Out of Pocket Health Expenditure (%): Percentage of total health expenditure paid out-of-pocket by individuals.
  27. Physicians per Thousand: Number of physicians per thousand people.
  28. Population: Total population of the country.
  29. Population: Labor Force Participation (%): Percentage of the population that is part of the labor force.
  30. Tax Revenue (%): Tax revenue as a percentage of GDP.
  31. Total Tax Rate: Overall tax burden as a percentage of commercial profits.
  32. Unemployment Rate: Percentage of the labor force that is unemployed.
  33. Urban Population: Percentage of the population living in urban areas.
  34. Latitude: Latitude coordinate of the country’s location.
  35. Longitude: Longitude coordinate of the country’s location.

Import Data

country_data <- read_csv("C:/Users/user/Downloads/archive(1)/world-data-2023.csv",show_col_types = FALSE)

TUGAS KELOMPOK

1. Dalam R missing data biasanya ditandai dengan NA, hitung berapa banyak missing value yang ada pada dataset!

missing_counts <- country_data %>%
  summarise(across(everything(), ~ sum(is.na(.))))

print(missing_counts)
## # A tibble: 1 × 35
##   Country `Density\n(P/Km2)` Abbreviation `Agricultural Land( %)`
##     <int>              <int>        <int>                   <int>
## 1       0                  0            7                       7
## # ℹ 31 more variables: `Land Area(Km2)` <int>, `Armed Forces size` <int>,
## #   `Birth Rate` <int>, `Calling Code` <int>, `Capital/Major City` <int>,
## #   `Co2-Emissions` <int>, CPI <int>, `CPI Change (%)` <int>,
## #   `Currency-Code` <int>, `Fertility Rate` <int>, `Forested Area (%)` <int>,
## #   `Gasoline Price` <int>, GDP <int>,
## #   `Gross primary education enrollment (%)` <int>,
## #   `Gross tertiary education enrollment (%)` <int>, …
total_missing <- sum(is.na(country_data))

cat("Total missing value dalam dataset:", total_missing, "\n")
## Total missing value dalam dataset: 337

2. Hapus semua missing data dan tunjukkan bahwa semua missing data sudah terhapus!

dataset_clean <- country_data %>%
  drop_na()

# Memeriksa apakah semua missing data sudah terhapus
if (sum(is.na(dataset_clean)) == 0) {
  cat("Semua missing data telah terhapus.\n")
} else {
  cat("Masih ada missing data dalam dataset.\n")
}
## Semua missing data telah terhapus.

3. Dalam dataset tersebut terdapat beberapa kolom yang berbentuk persentase dan nilai mata uang dollar US yang bertipe data chr. Gunakan fungsi mutate dan across untuk mengubah tipe data kolom tersebut menjadi dbl. Tunjukkan hasilnya dengan fungsi glimpse!

character_set <- dataset_clean %>% 
  mutate(
    Gasoline.Price = as.numeric(str_replace_all(str_replace_all(`Gasoline Price`, "\\$", ""), ",", "")),
    GDP = as.numeric(str_replace_all(str_replace_all(GDP, "\\$", ""), ",", "")),
    Minimum.wage= as.numeric(str_replace_all(str_replace_all(`Minimum wage`, "\\$", ""), ",", ""))
  ) %>% 
select(Gasoline.Price,GDP,Minimum.wage) %>% 
na.omit(character_set)
glimpse(character_set)
## Rows: 110
## Columns: 3
## $ Gasoline.Price <dbl> 0.70, 1.36, 0.28, 0.97, 1.10, 0.77, 0.93, 0.56, 1.12, 1…
## $ GDP            <dbl> 1.910135e+10, 1.527808e+10, 1.699882e+11, 9.463542e+10,…
## $ Minimum.wage   <dbl> 0.43, 1.12, 0.95, 0.71, 3.35, 0.66, 13.59, 0.47, 0.51, …

4. Konversi nilai mata uang dalam bentuk dollar ke nilai mata uang Rupiah dengan tipe data dbl!

nilai_tukar <- 15324

dataset <- character_set %>%
  mutate(Gasoline.Price_Rupiah = Gasoline.Price * nilai_tukar,
    GDP_Rupiah = GDP * nilai_tukar
  )
dataset
## # A tibble: 110 × 5
##    Gasoline.Price           GDP Minimum.wage Gasoline.Price_Rupiah GDP_Rupiah
##             <dbl>         <dbl>        <dbl>                 <dbl>      <dbl>
##  1           0.7    19101353833         0.43                10727.    2.93e14
##  2           1.36   15278077447         1.12                20841.    2.34e14
##  3           0.28  169988236398         0.95                 4291.    2.60e15
##  4           0.97   94635415870         0.71                14864.    1.45e15
##  5           1.1   449663446954         3.35                16856.    6.89e15
##  6           0.77   13672802158         0.66                11799.    2.10e14
##  7           0.93 1392680589329        13.6                 14251.    2.13e16
##  8           0.56   39207000000         0.47                 8581.    6.01e14
##  9           1.12  302571254131         0.51                17163.    4.64e15
## 10           1.81    5209000000         3.13                27736.    7.98e13
## # ℹ 100 more rows

5. Buatlah kolom country_status yang berisi kategori "rich","developing", "poor". Kategori ini didapatkan dengan kriteria sebagai berikut: jika Unemployment rate kurang dari 1% maka negara tersebut merupakan negara "rich", jika Unemployment rate lebih besar dari 0.999999% dan kurang dari 5% maka negara tersebut merupakan negara "developing". Kemudian jika Unemployment rate lebih besar dari 4.999999% maka negara tersebut merupakan negara"poor".

library(tidyverse)

data <- dataset_clean %>%
  mutate(
    dc_umprate = as.numeric(str_replace_all(`Unemployment rate`, "\\%", ""))
  ) %>%
  select(Country, dc_umprate) %>%
  na.omit()

# Membuat kolom "country_status" berdasarkan tingkat pengangguran
new_rate_dt <- data%>%
  mutate(
    country_status = case_when(
      dc_umprate < 1.0 ~ "rich",
      dc_umprate >= 1.0 & dc_umprate < 5.0 ~ "developing",
      dc_umprate >= 5.0 ~ "poor"
    )
  )

# Menampilkan hasil
new_rate_dt
## # A tibble: 110 × 3
##    Country     dc_umprate country_status
##    <chr>            <dbl> <chr>         
##  1 Afghanistan      11.1  poor          
##  2 Albania          12.3  poor          
##  3 Algeria          11.7  poor          
##  4 Angola            6.89 poor          
##  5 Argentina         9.79 poor          
##  6 Armenia          17.0  poor          
##  7 Australia         5.27 poor          
##  8 Azerbaijan        5.51 poor          
##  9 Bangladesh        4.19 developing    
## 10 Barbados         10.3  poor          
## # ℹ 100 more rows

6. Hitunglah rata-rata dan nilai maximum GDP dan CPI berdasarkan country status! Kemudian hitung juga berapa banyak negara yang termasuk dalam kategori "rich","developing", "poor".

combined_data <- dataset_clean %>%
  mutate(
    dc_umprate = as.numeric(str_replace_all(`Unemployment rate`, "\\%", "")),
    cpi.c  = as.numeric(str_replace_all(`CPI Change (%)`, "\\%", "")),
    GDP = as.numeric(str_replace_all(str_replace_all(GDP, "\\$", ""), ",", ""))
  ) %>% 
  select(Country, dc_umprate, cpi.c, GDP) %>% 
  left_join(new_rate_dt, by = "Country")

# Menghitung rata-rata dan nilai maksimum GDP dan CPI berdasarkan country_status
summary_data <- combined_data %>%
  group_by(country_status) %>%
  summarise(
    Avg_GDP = mean(GDP, na.rm = TRUE),
    Max_GDP = max(GDP, na.rm = TRUE),
    Avg_CPI = mean(cpi.c, na.rm = TRUE),
    Max_CPI = max(cpi.c, na.rm = TRUE),
    Count = n()
  )

# Menampilkan hasil perhitungan
print(summary_data)
## # A tibble: 3 × 6
##   country_status       Avg_GDP Max_GDP Avg_CPI Max_CPI Count
##   <chr>                  <dbl>   <dbl>   <dbl>   <dbl> <int>
## 1 developing     820523892191. 1.99e13    3.22    14.8    46
## 2 poor           647889907874. 2.14e13    6.24    53.5    61
## 3 rich           191583986805. 5.44e11    0.5      3.3     3

7. Tampilkan hasil nomor 6 dalam format long!

summary_data_long <- summary_data %>%
  pivot_longer(
    cols = c(Avg_GDP, Max_GDP, Avg_CPI, Max_CPI),
    names_to = "Variable",
    values_to = "Value"
  )

# Menampilkan hasil dalam format long
print(summary_data_long)
## # A tibble: 12 × 4
##    country_status Count Variable    Value
##    <chr>          <int> <chr>       <dbl>
##  1 developing        46 Avg_GDP  8.21e+11
##  2 developing        46 Max_GDP  1.99e+13
##  3 developing        46 Avg_CPI  3.22e+ 0
##  4 developing        46 Max_CPI  1.48e+ 1
##  5 poor              61 Avg_GDP  6.48e+11
##  6 poor              61 Max_GDP  2.14e+13
##  7 poor              61 Avg_CPI  6.24e+ 0
##  8 poor              61 Max_CPI  5.35e+ 1
##  9 rich               3 Avg_GDP  1.92e+11
## 10 rich               3 Max_GDP  5.44e+11
## 11 rich               3 Avg_CPI  5   e- 1
## 12 rich               3 Max_CPI  3.3 e+ 0