Case I: covid_data.csv
Intro: The status of different cases of COVID-19
Read the file
covid <- read.csv("https://raw.githubusercontent.com/Sugarcane-svg/R/main/R607/Projects/p2/p2_covid_data.csv")
head(covid, 8)## X sl_no country total_cases new_cases total_deaths new_deaths
## 1 0 NA North America 5359473 +6,261 214740 531
## 2 1 NA South America 3952480 +1,207 139781 88
## 3 2 NA Asia 4130442 +3,348 94073 49
## 4 3 NA Europe 2833461 +673 202526 3
## 5 4 NA Africa 895157 18908 NA
## 6 5 NA Oceania 18032 +717 213 13
## 7 6 NA 721 15 NA
## 8 7 NA World 17189766 +12,206 670256 684
## total_recovered active_cases serious_critical_cases total_cases_per_1m_pop
## 1 2747302 5870 2397431 26093
## 2 2692664 535 1120035 14274
## 3 3002028 4192 1034341 19740
## 4 1702083 15 928852 5119
## 5 544168 NA 332081 1120
## 6 12246 1000 5573 39
## 7 651 NA 55 4
## 8 10701142 11612 5818368 66389
## total_deaths_per_1m_pop total_tests total_tests_per_1m_pop population
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## 7 NA NA NA NA
## 8 2205 86 NA NA
Tidy covid data
1. remove some rows
the top 8 rows are showing data grouped by continents, however, here, we care more about the cases in different countries. Therefore, I am going to remove the first eight rows.
covid <- covid[-c(0:8),]
head(covid)## X sl_no country total_cases new_cases total_deaths new_deaths
## 9 8 1 USA 4568037 153840 NA
## 10 9 2 Brazil 2555518 90188 NA
## 11 10 3 India 1584384 35003 NA
## 12 11 4 Russia 828990 13673 NA
## 13 12 5 South Africa 471123 7497 NA
## 14 13 6 Mexico 408449 +5,752 45361 485
## total_recovered active_cases serious_critical_cases total_cases_per_1m_pop
## 9 2245044 NA 2169153 18852
## 10 1787419 NA 677911 8318
## 11 1021611 NA 527770 8944
## 12 620333 NA 194984 2300
## 13 297967 NA 165659 539
## 14 267147 5690 95941 3922
## total_deaths_per_1m_pop total_tests total_tests_per_1m_pop population
## 9 13794 465 56684379 171171
## 10 12016 424 12601096 59250
## 11 1147 25 18190382 13171
## 12 5680 94 27569646 188911
## 13 7936 126 2873163 48398
## 14 3165 352 953514 7389
2. remove some columns
from the table, there are many columns that contains most of NA, therefore, we aren’t going to use these variables to do analyze. Otherwise, it will cause lost of skewness.
covid <- covid %>%
select("sl_no","country","total_cases","total_deaths", "total_recovered", "serious_critical_cases", "total_tests")
covid <- covid %>% drop_na()Perform analysis
lets take a overlook of the data, there are 7 variables and total 187 observations of cases in different countries.
glimpse(covid)## Rows: 187
## Columns: 7
## $ sl_no <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
## $ country <chr> "USA", "Brazil", "India", "Russia", "South Afr…
## $ total_cases <int> 4568037, 2555518, 1584384, 828990, 471123, 408…
## $ total_deaths <dbl> 153840, 90188, 35003, 13673, 7497, 45361, 1881…
## $ total_recovered <dbl> 2245044, 1787419, 1021611, 620333, 297967, 267…
## $ serious_critical_cases <int> 2169153, 677911, 527770, 194984, 165659, 95941…
## $ total_tests <dbl> 465, 424, 25, 94, 126, 352, 570, 485, 608, 677…
1. how is the death cases in the top 10 listed in sl_no?
from the ratio of deaths, UK seems have high death rate followed by Mexico
top_10 <- covid %>%
filter(sl_no <= 10)%>%
select(country, total_cases, total_deaths)%>%
mutate(ratio_of_deaths = total_deaths/total_cases)
ggplot(top_10, aes(x = country, y = ratio_of_deaths)) + geom_bar(stat = "identity") + labs(title = "the ratio of death in 10 countries")2. how does test cases contribute to total cases?
as we can see that the average test cases all over countries is roughly 94 cases, the average total cases is 91908 each country. If we assume that all test cases were positive and every country have average total cases, it means that there were only in average 0.1% of population contribute to total cases, in other words, no people would like to do the test. And there are only 47 out of 187 countries have the test cases above average.
# the average of test cases all over the countries
avg_test<-sum(covid$total_tests)/nrow(covid)
avg_test## [1] 93.6631
# the average of total cases all over the countries
avg_total<-sum(covid$total_cases)/nrow(covid)
avg_total## [1] 91907.93
# the average ratio of test case
avg_ratio<- avg_test / avg_total
avg_ratio## [1] 0.001019097
# number of countries whose test case is above average ratio
num_country <- covid %>%
filter(total_tests>avg_test) %>%
count(country) %>%
summarise(sum(n))
num_country## sum(n)
## 1 47
3. what is the ratio of test cases contribute to their own country of total cases?
here, each country remains its own ratio of total test cases, where Montserrat has the highest population who is willing to do a covid test. The test rate here is 18%. It is much higher than the one we calculate before.
ratio_test<- covid %>%
select(country) %>%
mutate(test_rate = covid$total_tests/covid$total_cases)
ggplot(ratio_test, aes(x=country, y = test_rate)) + geom_bar(stat = "identity")# top test rate
ratio_test %>% filter(test_rate == max(test_rate))## country test_rate
## 1 Montserrat 16.66667
# avg test rate contributed by each country
sum(ratio_test$test_rate)/187## [1] 0.1831692
Conclusion
to do a test is encouraged, in that way we can see better the pandemic situation, we can have a better prediction in the positive test population. Even though there are many deaths cases, the ratio of deaths is still relatively low. Don’t be scary.
Case II: vaccin_data.csv
Intro: covid vaccine injection situation
Read the file
vaccine <- read.csv("https://raw.githubusercontent.com/Sugarcane-svg/R/main/R607/Projects/p2/p2_vaccin_data.csv")Tidy and transfrom data
Overview of data: there are 15 variables and nearly 6k observations. the data is big.
glimpse(vaccine)## Rows: 5,972
## Columns: 15
## $ country <chr> "Albania", "Albania", "Albania", …
## $ iso_code <chr> "ALB", "ALB", "ALB", "ALB", "ALB"…
## $ date <chr> "2021-01-10", "2021-01-11", "2021…
## $ total_vaccinations <dbl> 0, NA, 128, 188, 266, 308, 369, 4…
## $ people_vaccinated <dbl> 0, NA, 128, 188, 266, 308, 369, 4…
## $ people_fully_vaccinated <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ daily_vaccinations_raw <dbl> NA, NA, NA, 60, 78, 42, 61, 36, 4…
## $ daily_vaccinations <dbl> NA, 64, 64, 63, 66, 62, 62, 58, 5…
## $ total_vaccinations_per_hundred <dbl> 0.00, NA, 0.00, 0.01, 0.01, 0.01,…
## $ people_vaccinated_per_hundred <dbl> 0.00, NA, 0.00, 0.01, 0.01, 0.01,…
## $ people_fully_vaccinated_per_hundred <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ daily_vaccinations_per_million <dbl> NA, 22, 22, 22, 23, 22, 22, 20, 1…
## $ vaccines <chr> "Pfizer/BioNTech", "Pfizer/BioNTe…
## $ source_name <chr> "Ministry of Health", "Ministry o…
## $ source_website <chr> "https://shendetesia.gov.al/covid…
take a look at the first 10 rows, we observe that country and iso code remain the same, but the date is increment by one. Therefore, we can compress data
head(vaccine, 10)## country iso_code date total_vaccinations people_vaccinated
## 1 Albania ALB 2021-01-10 0 0
## 2 Albania ALB 2021-01-11 NA NA
## 3 Albania ALB 2021-01-12 128 128
## 4 Albania ALB 2021-01-13 188 188
## 5 Albania ALB 2021-01-14 266 266
## 6 Albania ALB 2021-01-15 308 308
## 7 Albania ALB 2021-01-16 369 369
## 8 Albania ALB 2021-01-17 405 405
## 9 Albania ALB 2021-01-18 447 447
## 10 Albania ALB 2021-01-19 483 483
## people_fully_vaccinated daily_vaccinations_raw daily_vaccinations
## 1 NA NA NA
## 2 NA NA 64
## 3 NA NA 64
## 4 NA 60 63
## 5 NA 78 66
## 6 NA 42 62
## 7 NA 61 62
## 8 NA 36 58
## 9 NA 42 55
## 10 NA 36 51
## total_vaccinations_per_hundred people_vaccinated_per_hundred
## 1 0.00 0.00
## 2 NA NA
## 3 0.00 0.00
## 4 0.01 0.01
## 5 0.01 0.01
## 6 0.01 0.01
## 7 0.01 0.01
## 8 0.01 0.01
## 9 0.02 0.02
## 10 0.02 0.02
## people_fully_vaccinated_per_hundred daily_vaccinations_per_million
## 1 NA NA
## 2 NA 22
## 3 NA 22
## 4 NA 22
## 5 NA 23
## 6 NA 22
## 7 NA 22
## 8 NA 20
## 9 NA 19
## 10 NA 18
## vaccines source_name
## 1 Pfizer/BioNTech Ministry of Health
## 2 Pfizer/BioNTech Ministry of Health
## 3 Pfizer/BioNTech Ministry of Health
## 4 Pfizer/BioNTech Ministry of Health
## 5 Pfizer/BioNTech Ministry of Health
## 6 Pfizer/BioNTech Ministry of Health
## 7 Pfizer/BioNTech Ministry of Health
## 8 Pfizer/BioNTech Ministry of Health
## 9 Pfizer/BioNTech Ministry of Health
## 10 Pfizer/BioNTech Ministry of Health
## source_website
## 1 https://shendetesia.gov.al/covid19-ministria-e-shendetesise-702-te-vaksinuar-3035-testime-714-te-sheruar-631-raste-te-reja-dhe-17-humbje-jete-ne-24-oret-e-fundit/
## 2 https://shendetesia.gov.al/covid19-ministria-e-shendetesise-702-te-vaksinuar-3035-testime-714-te-sheruar-631-raste-te-reja-dhe-17-humbje-jete-ne-24-oret-e-fundit/
## 3 https://shendetesia.gov.al/covid19-ministria-e-shendetesise-702-te-vaksinuar-3035-testime-714-te-sheruar-631-raste-te-reja-dhe-17-humbje-jete-ne-24-oret-e-fundit/
## 4 https://shendetesia.gov.al/covid19-ministria-e-shendetesise-702-te-vaksinuar-3035-testime-714-te-sheruar-631-raste-te-reja-dhe-17-humbje-jete-ne-24-oret-e-fundit/
## 5 https://shendetesia.gov.al/covid19-ministria-e-shendetesise-702-te-vaksinuar-3035-testime-714-te-sheruar-631-raste-te-reja-dhe-17-humbje-jete-ne-24-oret-e-fundit/
## 6 https://shendetesia.gov.al/covid19-ministria-e-shendetesise-702-te-vaksinuar-3035-testime-714-te-sheruar-631-raste-te-reja-dhe-17-humbje-jete-ne-24-oret-e-fundit/
## 7 https://shendetesia.gov.al/covid19-ministria-e-shendetesise-702-te-vaksinuar-3035-testime-714-te-sheruar-631-raste-te-reja-dhe-17-humbje-jete-ne-24-oret-e-fundit/
## 8 https://shendetesia.gov.al/covid19-ministria-e-shendetesise-702-te-vaksinuar-3035-testime-714-te-sheruar-631-raste-te-reja-dhe-17-humbje-jete-ne-24-oret-e-fundit/
## 9 https://shendetesia.gov.al/covid19-ministria-e-shendetesise-702-te-vaksinuar-3035-testime-714-te-sheruar-631-raste-te-reja-dhe-17-humbje-jete-ne-24-oret-e-fundit/
## 10 https://shendetesia.gov.al/covid19-ministria-e-shendetesise-702-te-vaksinuar-3035-testime-714-te-sheruar-631-raste-te-reja-dhe-17-humbje-jete-ne-24-oret-e-fundit/
1. drop some columns
There are many variables which cannot perform analysis later or are not so useful. For these variables, we are going to drop them to make data smaller. For example, [total_vaccinations_per_hundred], [people_vaccinated_per_hundred], [people_fully_vaccinated_per_hundred], [daily_vaccinations_per_million], [source_name], [source_website], [iso_code], [daily_vaccinations_raw], [vaccines]. After dropping columns, there are still 6k observations
vaccine <- vaccine %>%
select(-c(total_vaccinations_per_hundred,
people_vaccinated_per_hundred,
people_fully_vaccinated_per_hundred,
daily_vaccinations_per_million,
source_name,
source_website,
iso_code,daily_vaccinations_raw,
vaccines))
#columns left
colnames(vaccine)## [1] "country" "date"
## [3] "total_vaccinations" "people_vaccinated"
## [5] "people_fully_vaccinated" "daily_vaccinations"
2. group data by its country
take a close look at variable date, all observations for each country starts from different date and ends differently. In case we don’t miss any observation which would affect the number of vaccines, instead of dropping NA value, we group by them by its country.
approaches
- split data into tables with primary key(country) and foreign key
# num of days
days <- vaccine %>%
group_by(country)%>%
count(date) %>%
summarize(num_of_days = sum(n))
# num of total vaccination
total_vaccine<-vaccine%>%
group_by(country)%>%
summarise(total_vaccine=max(total_vaccinations, na.rm = TRUE))
# num of people who vaccinated
people_vaccinated <- vaccine%>%
group_by(country)%>%
summarise(people_vaccinated=
if(max(people_vaccinated, na.rm = TRUE)==-Inf) 0 else max(people_vaccinated, na.rm = TRUE))
# num of people who got fully vaccinated
people_full_vacc<-vaccine%>%
group_by(country)%>%
summarise(people_full_vacc=
if(max(people_fully_vaccinated, na.rm = TRUE)==-Inf) 0 else max(people_fully_vaccinated, na.rm = TRUE))- join these tables by primary key(country)
df <- days
df<-left_join(df, total_vaccine, by = "country")
df<-left_join(df, people_vaccinated, by = "country")
df<-left_join(df, people_full_vacc, by = "country")
head(df)## # A tibble: 6 x 5
## country num_of_days total_vaccine people_vaccinated people_full_vacc
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Albania 60 21613 6073 655
## 2 Algeria 22 75000 0 0
## 3 Andorra 45 4914 3650 1264
## 4 Angola 4 6169 6169 0
## 5 Anguilla 23 3929 3929 0
## 6 Argentina 73 1919074 1564555 354519
Perform analysis
1. How many vaccines do each country get each day? and which country gets the most and the least vaccinations each day?
from avg_vacc table, we can see the average vaccination situation of each country. Meanwhile, the most vaccinated country is the United States, and the least vaccinated country is Montserrat
avg_vacc <- df %>%
select(country, num_of_days, total_vaccine)%>%
mutate(avg_vacc = round(total_vaccine/num_of_days,0))
avg_vacc## # A tibble: 133 x 4
## country num_of_days total_vaccine avg_vacc
## <chr> <int> <dbl> <dbl>
## 1 Albania 60 21613 360
## 2 Algeria 22 75000 3409
## 3 Andorra 45 4914 109
## 4 Angola 4 6169 1542
## 5 Anguilla 23 3929 171
## 6 Argentina 73 1919074 26289
## 7 Australia 25 125000 5000
## 8 Austria 75 929497 12393
## 9 Azerbaijan 54 436849 8090
## 10 Bahrain 79 520429 6588
## # … with 123 more rows
# the most vaccinated country
avg_vacc %>% filter(avg_vacc == max(avg_vacc))## # A tibble: 1 x 4
## country num_of_days total_vaccine avg_vacc
## <chr> <int> <dbl> <dbl>
## 1 United States 82 98203893 1197608
# the least vaccinated country
avg_vacc %>% filter(avg_vacc == min(avg_vacc))## # A tibble: 1 x 4
## country num_of_days total_vaccine avg_vacc
## <chr> <int> <dbl> <dbl>
## 1 Montserrat 16 652 41
2. What is the future demand for vaccines in the top five countries in terms of total vaccine doses?
approches
- the top 5 countries are United States, China, India, United Kingdom, England.
df %>%
select(country, total_vaccine)%>%
arrange(desc(total_vaccine))## # A tibble: 133 x 2
## country total_vaccine
## <chr> <dbl>
## 1 United States 98203893
## 2 China 52520000
## 3 India 26164920
## 4 United Kingdom 24405231
## 5 England 20526312
## 6 Brazil 10740859
## 7 Turkey 10502494
## 8 Israel 9179072
## 9 Germany 8716654
## 10 Russia 7273933
## # … with 123 more rows
- select these 5 countries in vaccine table and create a new data set
the graph of China, England and United Kingdom show that the demand towards stability, however, the graph of India and United States seem still upward trend. So I assume that the demand for vaccines will continue to increase in these two countries
top_5 <- vaccine %>%
select(country,date,daily_vaccinations) %>%
filter(country == "United States" |
country =="China" |
country == "India" |
country == "United Kingdom" |
country =="England")
ggplot(top_5, aes(x = date, y = daily_vaccinations)) + geom_histogram(stat = "identity")+ facet_wrap(~ country)## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Removed 5 rows containing missing values (position_stack).
Case III: gdp_data.csv
Intro: The GDP of top 10 countries in the past ten years
Read file
gdp <- read.csv("https://raw.githubusercontent.com/Sugarcane-svg/R/main/R607/Projects/p2/p2_gdp_data.csv")Tidy and transform data
1. summary of data
compare to previous data set, this data has much more variables and much less observations.
glimpse(gdp)## Rows: 264
## Columns: 64
## $ Country.Name <chr> "Aruba", "Afghanistan", "Angola", "Albania", "Andorra"…
## $ Country.Code <chr> "ABW", "AFG", "AGO", "ALB", "AND", "ARB", "ARE", "ARG"…
## $ Indicator.Name <chr> "GDP growth (annual %)", "GDP growth (annual %)", "GDP…
## $ Indicator.Code <chr> "NY.GDP.MKTP.KD.ZG", "NY.GDP.MKTP.KD.ZG", "NY.GDP.MKTP…
## $ X1960 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ X1961 <dbl> NA, NA, NA, NA, NA, NA, NA, 5.4, NA, NA, NA, 2.5, 5.5,…
## $ X1962 <dbl> NA, NA, NA, NA, NA, NA, NA, -0.9, NA, NA, NA, 1.3, 2.6…
## $ X1963 <dbl> NA, NA, NA, NA, NA, NA, NA, -5.3, NA, NA, NA, 6.2, 4.1…
## $ X1964 <dbl> NA, NA, NA, NA, NA, NA, NA, 10.1, NA, NA, NA, 7.0, 6.1…
## $ X1965 <dbl> NA, NA, NA, NA, NA, NA, NA, 10.6, NA, NA, NA, 6.0, 3.5…
## $ X1966 <dbl> NA, NA, NA, NA, NA, NA, NA, -0.7, NA, NA, NA, 2.4, 5.6…
## $ X1967 <dbl> NA, NA, NA, NA, NA, NA, NA, 3.2, NA, NA, NA, 6.3, 3.0,…
## $ X1968 <dbl> NA, NA, NA, NA, NA, NA, NA, 4.8, NA, NA, NA, 5.1, 4.5,…
## $ X1969 <dbl> NA, NA, NA, NA, NA, NA, NA, 9.7, NA, NA, NA, 7.0, 6.3,…
## $ X1970 <dbl> NA, NA, NA, NA, NA, NA, NA, 3.0, NA, NA, NA, 7.2, 6.3,…
## $ X1971 <dbl> NA, NA, NA, NA, 4.6, NA, NA, 5.7, NA, NA, NA, 4.0, 5.1…
## $ X1972 <dbl> NA, NA, NA, NA, 8.1, NA, NA, 1.6, NA, NA, NA, 3.9, 6.2…
## $ X1973 <dbl> NA, NA, NA, NA, 7.8, NA, NA, 2.8, NA, NA, NA, 2.6, 4.9…
## $ X1974 <dbl> NA, NA, NA, NA, 5.6, NA, NA, 5.5, NA, NA, NA, 4.1, 3.9…
## $ X1975 <dbl> NA, NA, NA, NA, 0.5, NA, NA, 0.0, NA, NA, NA, 1.3, -0.…
## $ X1976 <dbl> NA, NA, NA, NA, 3.3, 15.7, 16.5, -2.0, NA, NA, NA, 2.6…
## $ X1977 <dbl> NA, NA, NA, NA, 2.8, 8.2, 21.4, 6.9, NA, NA, NA, 3.6, …
## $ X1978 <dbl> NA, NA, NA, NA, 1.5, -0.7, -1.6, -4.5, NA, NA, 4.4, 0.…
## $ X1979 <dbl> NA, NA, NA, NA, 0.0, 11.7, 20.9, 10.2, NA, NA, 8.1, 4.…
## $ X1980 <dbl> NA, NA, NA, NA, 2.2, 9.1, 23.9, 1.5, NA, NA, 8.2, 3.0,…
## $ X1981 <dbl> NA, NA, -4.4, 5.7, -0.1, 3.0, 4.7, -5.2, NA, NA, 3.8, …
## $ X1982 <dbl> NA, NA, 0.0, 2.9, 1.2, -9.1, -6.7, -0.7, NA, NA, -0.1,…
## $ X1983 <dbl> NA, NA, 4.2, 1.1, 1.8, -6.9, -4.7, 4.3, NA, NA, 5.4, -…
## $ X1984 <dbl> NA, NA, 6.0, -1.3, 1.8, 1.3, 4.0, 1.6, NA, NA, 10.2, 4…
## $ X1985 <dbl> NA, NA, 3.5, 1.8, 2.3, -2.4, -3.6, -5.2, NA, NA, 7.6, …
## $ X1986 <dbl> NA, NA, 2.9, 5.6, 3.3, 4.7, -15.0, 6.2, NA, NA, 11.5, …
## $ X1987 <dbl> 16.1, NA, 4.1, -0.8, 5.5, -0.5, 3.4, 2.7, NA, NA, 6.6,…
## $ X1988 <dbl> 18.6, NA, 6.1, -1.4, 5.1, 5.7, -2.6, -1.1, NA, NA, 5.2…
## $ X1989 <dbl> 12.1, NA, 0.0, 9.8, 4.8, 2.4, 12.3, -7.2, NA, NA, 5.3,…
## $ X1990 <dbl> 4.0, NA, -3.5, -9.6, 3.8, 13.1, 18.3, -2.5, NA, NA, 3.…
## $ X1991 <dbl> 8.0, NA, 1.0, -28.0, 2.5, 1.5, 0.9, 9.1, -11.7, NA, 2.…
## $ X1992 <dbl> 5.9, NA, -5.8, -7.2, 0.9, 4.9, 3.3, 7.9, -41.8, NA, 1.…
## $ X1993 <dbl> 7.3, NA, -24.0, 9.6, -1.0, 3.3, 1.3, 8.2, -8.8, NA, 5.…
## $ X1994 <dbl> 8.2, NA, 1.3, 8.3, 2.4, 3.2, 6.9, 5.8, 5.4, NA, 6.7, 4…
## $ X1995 <dbl> 2.5, NA, 15.0, 13.3, 2.8, 2.7, 6.7, -2.8, 6.9, NA, -4.…
## $ X1996 <dbl> 1.2, NA, 13.5, 9.1, 4.6, 4.6, 5.8, 5.5, 5.9, NA, 6.6, …
## $ X1997 <dbl> 7.0, NA, 7.3, -10.9, 9.1, 4.2, 8.2, 8.1, 3.3, NA, 5.5,…
## $ X1998 <dbl> 2.0, NA, 4.7, 8.8, 3.2, 5.3, 0.3, 3.9, 7.3, NA, 4.7, 4…
## $ X1999 <dbl> 1.2, NA, 2.2, 12.9, 4.1, 1.8, 2.9, -3.4, 3.3, NA, 3.7,…
## $ X2000 <dbl> 7.6, NA, 3.1, 7.0, 3.5, 5.5, 10.9, -0.8, 5.9, NA, 6.2,…
## $ X2001 <dbl> -3.0, NA, 4.2, 8.3, 8.1, 1.6, 1.4, -4.4, 9.6, NA, -4.5…
## $ X2002 <dbl> -3.3, NA, 13.7, 4.5, 4.5, 0.6, 2.4, -10.9, 13.2, NA, 1…
## $ X2003 <dbl> 2.0, 8.8, 3.0, 5.5, 8.7, 5.3, 8.8, 8.8, 14.0, 0.8, 6.1…
## $ X2004 <dbl> 7.9, 1.4, 11.0, 5.5, 8.1, 9.4, 9.6, 9.0, 10.5, 0.5, 5.…
## $ X2005 <dbl> 1.2, 11.2, 15.0, 5.5, 5.4, 5.7, 4.9, 8.9, 13.9, -0.4, …
## $ X2006 <dbl> 1.1, 5.4, 11.5, 5.9, 4.8, 6.5, 9.8, 8.0, 13.2, -4.2, 1…
## $ X2007 <dbl> 1.8, 13.8, 14.0, 6.0, 1.6, 4.5, 3.2, 9.0, 13.7, 2.0, 9…
## $ X2008 <dbl> -0.1, 3.9, 11.2, 7.5, -5.6, 5.8, 3.2, 4.1, 6.9, -2.6, …
## $ X2009 <dbl> -10.5, 21.4, 0.9, 3.4, -5.3, 0.4, -5.2, -5.9, -14.1, -…
## $ X2010 <dbl> -3.7, 14.4, 4.4, 3.7, -2.0, 4.7, 1.6, 10.1, 2.2, 0.4, …
## $ X2011 <dbl> 3.4, 0.4, 3.5, 2.5, 0.0, 3.7, 6.9, 6.0, 4.7, 0.3, -2.0…
## $ X2012 <dbl> -1.4, 12.8, 8.5, 1.4, -5.0, 6.7, 4.5, -1.0, 7.2, -4.4,…
## $ X2013 <dbl> 4.2, 5.6, 5.0, 1.0, -3.5, 3.3, 5.1, 2.4, 3.3, -2.8, -0…
## $ X2014 <dbl> 0.3, 2.7, 4.8, 1.8, 2.5, 2.5, 4.3, -2.5, 3.6, 0.9, 3.8…
## $ X2015 <dbl> 5.7, 1.5, 0.9, 2.2, 1.4, 3.2, 5.1, 2.7, 3.2, 1.4, 3.8,…
## $ X2016 <dbl> 2.1, 2.3, -2.6, 3.3, 3.7, 3.5, 3.1, -2.1, 0.2, -2.8, 5…
## $ X2017 <dbl> 2.0, 2.6, -0.1, 3.8, 0.3, 0.8, 2.4, 2.8, 7.5, -5.9, 3.…
## $ X2018 <dbl> NA, 1.2, -2.0, 4.1, 1.6, 2.1, 1.2, -2.6, 5.2, 2.2, 7.4…
## $ X2019 <dbl> NA, 3.9, -0.6, 2.2, 1.8, 1.5, 1.7, -2.1, 7.6, NA, 4.7,…
2. drop columns
we are going to see the status of GDP in the past ten years. So, we need to drop some unnecessary columns. For example, [Country.Code], I am not familiar with country code :(. [Indicator.Name] & [Indicator.Code], no calculated values. [X1960 : X2009], beyond the analysis.
gdp <- gdp %>% select(-c(2:54))
colnames(gdp)## [1] "Country.Name" "X2010" "X2011" "X2012" "X2013"
## [6] "X2014" "X2015" "X2016" "X2017" "X2018"
## [11] "X2019"
3. drop rows
there are some countries that don’t have data of gdp in the past ten years. We only care about the countries with gdp growth, so it no need to keep these countries in side the data set.
gdp <- gdp %>% drop_na()
head(gdp)## Country.Name X2010 X2011 X2012 X2013 X2014 X2015 X2016 X2017 X2018
## 1 Afghanistan 14.4 0.4 12.8 5.6 2.7 1.5 2.3 2.6 1.2
## 2 Angola 4.4 3.5 8.5 5.0 4.8 0.9 -2.6 -0.1 -2.0
## 3 Albania 3.7 2.5 1.4 1.0 1.8 2.2 3.3 3.8 4.1
## 4 Andorra -2.0 0.0 -5.0 -3.5 2.5 1.4 3.7 0.3 1.6
## 5 Arab World 4.7 3.7 6.7 3.3 2.5 3.2 3.5 0.8 2.1
## 6 United Arab Emirates 1.6 6.9 4.5 5.1 4.3 5.1 3.1 2.4 1.2
## X2019
## 1 3.9
## 2 -0.6
## 3 2.2
## 4 1.8
## 5 1.5
## 6 1.7
4. transfrom data
Even through the data set looks cleaner than before, it is still hard for us to plot or analyze it, as a result, I am going to transform data into [country name]: character [years]: numeric [gdp]: numeric
df.long <- pivot_longer(gdp, cols=2:11, names_to = "years", values_to = "gdp")
head(df.long)## # A tibble: 6 x 3
## Country.Name years gdp
## <chr> <chr> <dbl>
## 1 Afghanistan X2010 14.4
## 2 Afghanistan X2011 0.4
## 3 Afghanistan X2012 12.8
## 4 Afghanistan X2013 5.6
## 5 Afghanistan X2014 2.7
## 6 Afghanistan X2015 1.5
Perform analysis
1. what is the mean and median of each countries’ GDP?
summary_df<-df.long %>%
group_by(Country.Name)%>%
summarize(mean = mean(gdp), median = median(gdp))
head(summary_df)## # A tibble: 6 x 3
## Country.Name mean median
## <chr> <dbl> <dbl>
## 1 Afghanistan 4.74 2.65
## 2 Albania 2.6 2.35
## 3 Algeria 2.67 3.05
## 4 Andorra 0.08 0.85
## 5 Angola 2.18 2.2
## 6 Antigua and Barbuda 2.13 3.6
2. how many countries have negative gdp growth?
summary_df %>%
filter(mean<0)%>%
count(Country.Name)%>%
summarize(number_of_country_have_negative_GDP = sum(n))## # A tibble: 1 x 1
## number_of_country_have_negative_GDP
## <int>
## 1 10
3. which country has highest GDP growth in 2019?
that surprise me, the country which has the highest GDP growth in 2019 is Timor-Leste. the gdp of this country in 2018 is -1.1, and gdp of 2019 is 18.7.
growth18 <- df.long%>%
filter(years == "X2018")%>%
select(Country.Name, gdp)
growth19 <-df.long%>%
filter(years == "X2019")%>%
select(Country.Name, gdp)
highest19<-left_join(growth18, growth19, by = "Country.Name")
highest19 <- highest19 %>%
mutate(growth = (gdp.y - gdp.x))
highest19 %>% filter(growth == max(growth))## # A tibble: 1 x 4
## Country.Name gdp.x gdp.y growth
## <chr> <dbl> <dbl> <dbl>
## 1 Timor-Leste -1.1 18.7 19.8
let’s take la close look at the graph, Timor-Leste’s GDP had been going down from 2010 to 2018. However, before 2016, the growth is showing positive which means that it still make some profits. from 2016 to 2018, the GDP broke the freezing point. That must be something serious happpen.
tl<-df.long%>%filter(Country.Name == "Timor-Leste")
ggplot(tl, aes(x = years, y = gdp)) + geom_bar(stat = "identity")4. what about countries with GDP growth like Timor-Leste?
Countries that are not major economies have increased their GDP by at least 1 percentage point between 2018 and 2019, surprisingly, basically no countries in the global economy have appeared.
greater1<-highest19 %>% filter(growth >1)
ggplot(greater1, aes(x = Country.Name, y = growth, fill = Country.Name)) + geom_bar(stat = "identity") + coord_flip()5. What about countries like United States, China, Japan, Germany and France? (just list some countries’ names on the top of my head now)
The graph of China has similar trends to those mentioned above, the percentage of GDP has declined slightly since 2018. The plot of the United States is very close to uniform distribution, that the economy of this country is very stable.
listed_countries <- df.long %>%
filter(Country.Name == "United States" |
Country.Name == "China" |
Country.Name == "Japan" |
Country.Name == "Germany" |
Country.Name == "France")
ggplot(listed_countries, aes(x = years, y = gdp)) + geom_bar(stat = "identity") + facet_wrap(~Country.Name)