library(readxl)
library(tidyr)
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(knitr)
library(stringr)
library(outliers)
This assignment focused on the two datasets comes from World Bank Organization. In these data sets,it provides key health, nutrition and population statistics gathered from a variety of international and national sources. Themes include global surgery, health financing, HIV/AIDS, immunization, infectious diseases, medical resources and usage, noncommunicable diseases, nutrition, population dynamics, reproductive health, universal health coverage, and water and sanitation. It covers all the countries and regions in the earth and the data starts from year 1960 to 2018. So there are lots of information and data in these data base.
In terms of this assignment requirement, I only investigate the population cross the world in different age group and gender. At first, I load the excel files into dataframe by the related functions, print out the dataframe structures, dimensions and other information. Then, I quickly go through all the variables, observations cross the datasets, to grab some ideas to understand the relationships between datasets and have a look at it if the data is tidy or not. After that, I use the tools and libraries I learned in class to tidy up these data. After that, I joined the two datasets into one by a common variable, subset a small dataset from the large one as the original dataset are huge and covered lots of areas related to human.
In the last part of this assignment, after tidying up the data, I scanned the data to find some missing values, outlier, then I deal with these exceptions in the dataset. As required by the assignment, I did an appropriate transformation on a variable, print out the graphic as well.
The dataset links in this assignment is: https://datacatalog.worldbank.org/dataset/health-nutrition-and-population-statistics
As there are several excel files in the dataset, I picked up two of them to finish my assignment. First, loading the excels into dataframe and print out some structure information.
#Load Data.xlsx into Data
Data <- read_excel("HNP/Data.xlsx")
-
\
|
/
-
\
|
/
#Print out the first five rows
head(Data)
#Load Series.xlsx into Series
Series <- read_excel("HNP/Series.xlsx")
-
/
#Print out the first five rows
head(Series)
Country Name: the country name Country Code: the code of a country, usually it’s the abbrevation of the country name. Indicator Name: a brief inforamtion about the statistic data. Indicator Code: the code of the indicator name. Year: stands for when the data comes from Series Code: a code which is the same as Indicator Code, stands for the category of the statistic data. Topic: a group name for each data Periodicity: The frequency of the data updated Aggregation method: how the data aggregation.
By using the function “str” to inspect the data variables.
From the output, I can see that there are 82 variables and 20 variables in Data and Series. The most of types of these variables are character and number, but apparently, some of them shoud be factor,
#Print out the structures
str(Data)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 104377 obs. of 63 variables:
$ Country Name : chr "Arab World" "Arab World" "Arab World" "Arab World" ...
$ Country Code : chr "ARB" "ARB" "ARB" "ARB" ...
$ Indicator Name: chr "Adolescent fertility rate (births per 1,000 women ages 15-19)" "Adults (ages 15+) and children (0-14 years) living with HIV" "Adults (ages 15+) and children (ages 0-14) newly infected with HIV" "Adults (ages 15+) living with HIV" ...
$ Indicator Code: chr "SP.ADO.TFRT" "SH.HIV.TOTL" "SH.HIV.INCD.TL" "SH.DYN.AIDS" ...
$ 1960 : num 135 NA NA NA NA ...
$ 1961 : num 135 NA NA NA NA ...
$ 1962 : num 136 NA NA NA NA ...
$ 1963 : num 136 NA NA NA NA ...
$ 1964 : num 136 NA NA NA NA ...
$ 1965 : num 135 NA NA NA NA ...
$ 1966 : num 135 NA NA NA NA ...
$ 1967 : num 134 NA NA NA NA ...
$ 1968 : num 131 NA NA NA NA ...
$ 1969 : num 129 NA NA NA NA ...
$ 1970 : num 127 NA NA NA NA ...
$ 1971 : num 124 NA NA NA NA ...
$ 1972 : num 122 NA NA NA NA ...
$ 1973 : num 120 NA NA NA NA ...
$ 1974 : num 118 NA NA NA NA ...
$ 1975 : num 115 NA NA NA NA ...
$ 1976 : num 113 NA NA NA NA ...
$ 1977 : num 110 NA NA NA NA ...
$ 1978 : num 107 NA NA NA NA ...
$ 1979 : num 103 NA NA NA NA ...
$ 1980 : num 99 NA NA NA NA ...
$ 1981 : num 95.4 NA NA NA NA ...
$ 1982 : num 91.8 NA NA NA NA ...
$ 1983 : num 88.9 NA NA NA NA ...
$ 1984 : num 85.9 NA NA NA NA ...
$ 1985 : num 83 NA NA NA NA ...
$ 1986 : num 79.9 NA NA NA NA ...
$ 1987 : num 76.9 NA NA NA NA ...
$ 1988 : num 74.8 NA NA NA NA ...
$ 1989 : num 72.8 NA NA NA NA ...
$ 1990 : num 71.1 NA NA NA NA ...
$ 1991 : num 69.1 NA NA NA NA ...
$ 1992 : num 67.3 NA NA NA NA ...
$ 1993 : num 65.3 NA NA NA NA ...
$ 1994 : num 63.4 NA NA NA NA ...
$ 1995 : num 61.2 NA NA NA NA ...
$ 1996 : num 59.3 NA NA NA NA ...
$ 1997 : num 57.3 NA NA NA NA ...
$ 1998 : num 56.3 NA NA NA NA ...
$ 1999 : num 55.3 NA NA NA NA ...
$ 2000 : num 54.3 NA NA NA NA ...
$ 2001 : num 53.3 NA NA NA NA ...
$ 2002 : num 52.3 NA NA NA NA ...
$ 2003 : num 51.9 NA NA NA NA ...
$ 2004 : num 51.5 NA NA NA NA ...
$ 2005 : num 51.2 NA NA NA NA ...
$ 2006 : num 50.8 NA NA NA NA ...
$ 2007 : num 50.5 NA NA NA NA ...
$ 2008 : num 50.3 NA NA NA NA ...
$ 2009 : num 50.1 NA NA NA NA ...
$ 2010 : num 50 NA NA NA NA ...
$ 2011 : num 49.9 NA NA NA NA ...
$ 2012 : num 49.8 NA NA NA NA ...
$ 2013 : num 49.3 NA NA NA NA ...
$ 2014 : num 48.9 NA NA NA NA ...
$ 2015 : num 48.3 NA NA NA NA ...
$ 2016 : num 47.5 NA NA NA NA ...
$ 2017 : num 46.7 NA NA NA NA ...
$ 2018 : num NA NA NA NA NA ...
#Print out structure inforamtion
str(Series)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 403 obs. of 20 variables:
$ Series Code : chr "HD.HCI.OVRL" "HD.HCI.OVRL.FE" "HD.HCI.OVRL.LB" "HD.HCI.OVRL.LB.FE" ...
$ Topic : chr "Public Sector: Policy & institutions" "Public Sector: Policy & institutions" "Public Sector: Policy & institutions" "Public Sector: Policy & institutions" ...
$ Indicator Name : chr "Human capital index (HCI) (scale 0-1)" "Human capital index (HCI), female (scale 0-1)" "Human capital index (HCI), lower bound (scale 0-1)" "Human capital index (HCI), female, lower bound (scale 0-1)" ...
$ Short definition : chr NA NA NA NA ...
$ Long definition : chr "The HCI calculates the contributions of health and education to worker productivity. The final index score rang"| __truncated__ "The HCI calculates the contributions of health and education to worker productivity. The final index score rang"| __truncated__ "The HCI lower bound reflects uncertainty in the measurement of the components and the overall index. It is obta"| __truncated__ "The HCI lower bound reflects uncertainty in the measurement of the components and the overall index. It is obta"| __truncated__ ...
$ Unit of measure : logi NA NA NA NA NA NA ...
$ Periodicity : chr NA NA NA NA ...
$ Base Period : logi NA NA NA NA NA NA ...
$ Other notes : logi NA NA NA NA NA NA ...
$ Aggregation method : chr NA NA NA NA ...
$ Limitations and exceptions : chr NA NA NA NA ...
$ Notes from original source : chr NA NA NA NA ...
$ General comments : chr NA NA NA NA ...
$ Source : chr "World Bank staff calculations based on the methodology described in World Bank (2018). https://openknowledge.wo"| __truncated__ "World Bank staff calculations based on the methodology described in World Bank (2018). https://openknowledge.wo"| __truncated__ "World Bank staff calculations based on the methodology described in World Bank (2018). https://openknowledge.wo"| __truncated__ "World Bank staff calculations based on the methodology described in World Bank (2018). https://openknowledge.wo"| __truncated__ ...
$ Statistical concept and methodology: chr NA NA NA NA ...
$ Development relevance : chr NA NA NA NA ...
$ Related source links : logi NA NA NA NA NA NA ...
$ Other web links : logi NA NA NA NA NA NA ...
$ Related indicators : logi NA NA NA NA NA NA ...
$ License Type : chr "CC BY-4.0" "CC BY-4.0" "CC BY-4.0" "CC BY-4.0" ...
#print out the column names
names(Data)
[1] "Country Name" "Country Code" "Indicator Name" "Indicator Code" "1960"
[6] "1961" "1962" "1963" "1964" "1965"
[11] "1966" "1967" "1968" "1969" "1970"
[16] "1971" "1972" "1973" "1974" "1975"
[21] "1976" "1977" "1978" "1979" "1980"
[26] "1981" "1982" "1983" "1984" "1985"
[31] "1986" "1987" "1988" "1989" "1990"
[36] "1991" "1992" "1993" "1994" "1995"
[41] "1996" "1997" "1998" "1999" "2000"
[46] "2001" "2002" "2003" "2004" "2005"
[51] "2006" "2007" "2008" "2009" "2010"
[56] "2011" "2012" "2013" "2014" "2015"
[61] "2016" "2017" "2018"
#print out the column names.
names(Series)
[1] "Series Code" "Topic"
[3] "Indicator Name" "Short definition"
[5] "Long definition" "Unit of measure"
[7] "Periodicity" "Base Period"
[9] "Other notes" "Aggregation method"
[11] "Limitations and exceptions" "Notes from original source"
[13] "General comments" "Source"
[15] "Statistical concept and methodology" "Development relevance"
[17] "Related source links" "Other web links"
[19] "Related indicators" "License Type"
Appearently, the Data dataset is not tidy up, column names are values instead of variables. Therefore I use gather function to transform data from wide to long format. After tidy up the Data dataframe, I selected a fews columns in Series dataframe, and join it with Data to generant a new dataframe.
# Gather some columns.
Data <- Data %>% gather(key="Year", value = "Value", -(1:4))
head(Data)
# Selected a fews variables from Series.
Series <- Series %>% select("Series Code", "Topic", "Periodicity", "Aggregation method")
head(Series)
# Rename the columns
names(Data)[names(Data) == "Country Code"] <- "CountryCode"
names(Data)[names(Data) == "Indicator Code"] <- "IndicatorCode"
names(Series)[names(Series) == "Series Code"] <- "IndicatorCode"
# Left join Data and Series by IndicatorCode
Data <- Data %>% left_join(Series)
Joining, by = "IndicatorCode"
head(Data)
# Filter the Data frame to pick up all human population data to generate a new dataframe Data_Pop
Data_Pop <- Data %>% filter(str_detect(IndicatorCode, "^SP.POP.(([0-9]{4})|(80UP)).(FE|MA)$"))
head(Data)
Continuly tidy and manipulate the dataframe.
First convert some character variables to factor, lable the IndicatorCode variable and order it. Then create two datasets targets to Female and Male ppopulation.
# Print the data structure
str(Data_Pop)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 519554 obs. of 9 variables:
$ Country Name : chr "Arab World" "Arab World" "Arab World" "Arab World" ...
$ CountryCode : chr "ARB" "ARB" "ARB" "ARB" ...
$ Indicator Name : chr "Population ages 00-04, female" "Population ages 00-04, male" "Population ages 05-09, female" "Population ages 05-09, male" ...
$ IndicatorCode : chr "SP.POP.0004.FE" "SP.POP.0004.MA" "SP.POP.0509.FE" "SP.POP.0509.MA" ...
$ Year : chr "1960" "1960" "1960" "1960" ...
$ Value : num 8051199 8366535 6406768 6685409 5043026 ...
$ Topic : chr "Health: Population: Structure" "Health: Population: Structure" "Health: Population: Structure" "Health: Population: Structure" ...
$ Periodicity : chr "Annual" "Annual" "Annual" "Annual" ...
$ Aggregation method: chr "Sum" "Sum" "Sum" "Sum" ...
# Print the unique value for ContryCode and Year column.
unique(Data_Pop$CountryCode)
[1] "ARB" "CSS" "CEB" "EAR" "EAS" "EAP" "TEA" "EMU" "ECS" "ECA" "TEC" "EUU" "FCS" "HPC"
[15] "HIC" "LTE" "LCN" "LAC" "TLA" "LDC" "LMY" "LIC" "LMC" "MEA" "MNA" "TMN" "MIC" "NAC"
[29] "INX" "OED" "OSS" "PSS" "PST" "PRE" "SST" "SAS" "TSA" "SSF" "SSA" "TSS" "UMC" "WLD"
[43] "AFG" "ALB" "DZA" "ASM" "AND" "AGO" "ATG" "ARG" "ARM" "ABW" "AUS" "AUT" "AZE" "BHS"
[57] "BHR" "BGD" "BRB" "BLR" "BEL" "BLZ" "BEN" "BMU" "BTN" "BOL" "BIH" "BWA" "BRA" "VGB"
[71] "BRN" "BGR" "BFA" "BDI" "CPV" "KHM" "CMR" "CAN" "CYM" "CAF" "TCD" "CHI" "CHL" "CHN"
[85] "COL" "COM" "COD" "COG" "CRI" "CIV" "HRV" "CUB" "CUW" "CYP" "CZE" "DNK" "DJI" "DMA"
[99] "DOM" "ECU" "EGY" "SLV" "GNQ" "ERI" "EST" "SWZ" "ETH" "FRO" "FJI" "FIN" "FRA" "PYF"
[113] "GAB" "GMB" "GEO" "DEU" "GHA" "GIB" "GRC" "GRL" "GRD" "GUM" "GTM" "GIN" "GNB" "GUY"
[127] "HTI" "HND" "HKG" "HUN" "ISL" "IND" "IDN" "IRN" "IRQ" "IRL" "IMN" "ISR" "ITA" "JAM"
[141] "JPN" "JOR" "KAZ" "KEN" "KIR" "PRK" "KOR" "XKX" "KWT" "KGZ" "LAO" "LVA" "LBN" "LSO"
[155] "LBR" "LBY" "LIE" "LTU" "LUX" "MAC" "MDG" "MWI" "MYS" "MDV" "MLI" "MLT" "MHL" "MRT"
[169] "MUS" "MEX" "FSM" "MDA" "MCO" "MNG" "MNE" "MAR" "MOZ" "MMR" "NAM" "NRU" "NPL" "NLD"
[183] "NCL" "NZL" "NIC" "NER" "NGA" "MKD" "MNP" "NOR" "OMN" "PAK" "PLW" "PAN" "PNG" "PRY"
[197] "PER" "PHL" "POL" "PRT" "PRI" "QAT" "ROU" "RUS" "RWA" "WSM" "SMR" "STP" "SAU" "SEN"
[211] "SRB" "SYC" "SLE" "SGP" "SXM" "SVK" "SVN" "SLB" "SOM" "ZAF" "SSD" "ESP" "LKA" "KNA"
[225] "LCA" "MAF" "VCT" "SDN" "SUR" "SWE" "CHE" "SYR" "TJK" "TZA" "THA" "TLS" "TGO" "TON"
[239] "TTO" "TUN" "TUR" "TKM" "TCA" "TUV" "UGA" "UKR" "ARE" "GBR" "USA" "URY" "UZB" "VUT"
[253] "VEN" "VNM" "VIR" "PSE" "YEM" "ZMB" "ZWE"
unique(Data_Pop$Year)
[1] "1960" "1961" "1962" "1963" "1964" "1965" "1966" "1967" "1968" "1969" "1970" "1971"
[13] "1972" "1973" "1974" "1975" "1976" "1977" "1978" "1979" "1980" "1981" "1982" "1983"
[25] "1984" "1985" "1986" "1987" "1988" "1989" "1990" "1991" "1992" "1993" "1994" "1995"
[37] "1996" "1997" "1998" "1999" "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007"
[49] "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016" "2017" "2018"
# Convert CountryCode to factor
Data_Pop <- Data_Pop %>% mutate(CountryCode = factor(CountryCode))
# Conver Year to factor
Data_Pop <- Data_Pop %>% mutate(Year = factor(Year))
# Print structure
str(Data_Pop)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 519554 obs. of 9 variables:
$ Country Name : chr "Arab World" "Arab World" "Arab World" "Arab World" ...
$ CountryCode : Factor w/ 259 levels "ABW","AFG","AGO",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Indicator Name : chr "Population ages 00-04, female" "Population ages 00-04, male" "Population ages 05-09, female" "Population ages 05-09, male" ...
$ IndicatorCode : chr "SP.POP.0004.FE" "SP.POP.0004.MA" "SP.POP.0509.FE" "SP.POP.0509.MA" ...
$ Year : Factor w/ 59 levels "1960","1961",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Value : num 8051199 8366535 6406768 6685409 5043026 ...
$ Topic : chr "Health: Population: Structure" "Health: Population: Structure" "Health: Population: Structure" "Health: Population: Structure" ...
$ Periodicity : chr "Annual" "Annual" "Annual" "Annual" ...
$ Aggregation method: chr "Sum" "Sum" "Sum" "Sum" ...
# Pickup all the female population data into Data_Pop_FE
Data_Pop_FE <- Data_Pop %>% filter(endsWith(IndicatorCode, "FE"))
head(Data_Pop_FE)
# Pickup all the male population data into Data_Pop_MA
Data_Pop_MA <- Data_Pop %>% filter(endsWith(IndicatorCode, "MA"))
head(Data_Pop_MA)
# Print out the structor
str(Data_Pop_FE)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 259777 obs. of 9 variables:
$ Country Name : chr "Arab World" "Arab World" "Arab World" "Arab World" ...
$ CountryCode : Factor w/ 259 levels "ABW","AFG","AGO",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Indicator Name : chr "Population ages 00-04, female" "Population ages 05-09, female" "Population ages 10-14, female" "Population ages 15-19, female" ...
$ IndicatorCode : chr "SP.POP.0004.FE" "SP.POP.0509.FE" "SP.POP.1014.FE" "SP.POP.1519.FE" ...
$ Year : Factor w/ 59 levels "1960","1961",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Value : num 8051199 6406768 5043026 4091892 3677343 ...
$ Topic : chr "Health: Population: Structure" "Health: Population: Structure" "Health: Population: Structure" "Health: Population: Structure" ...
$ Periodicity : chr "Annual" "Annual" "Annual" "Annual" ...
$ Aggregation method: chr "Sum" "Sum" "Sum" "Sum" ...
str(Data_Pop_MA)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 259777 obs. of 9 variables:
$ Country Name : chr "Arab World" "Arab World" "Arab World" "Arab World" ...
$ CountryCode : Factor w/ 259 levels "ABW","AFG","AGO",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Indicator Name : chr "Population ages 00-04, male" "Population ages 05-09, male" "Population ages 10-14, male" "Population ages 15-19, male" ...
$ IndicatorCode : chr "SP.POP.0004.MA" "SP.POP.0509.MA" "SP.POP.1014.MA" "SP.POP.1519.MA" ...
$ Year : Factor w/ 59 levels "1960","1961",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Value : num 8366535 6685409 5347352 4223619 3776630 ...
$ Topic : chr "Health: Population: Structure" "Health: Population: Structure" "Health: Population: Structure" "Health: Population: Structure" ...
$ Periodicity : chr "Annual" "Annual" "Annual" "Annual" ...
$ Aggregation method: chr "Sum" "Sum" "Sum" "Sum" ...
# Convert IndicatorCode to factor
Data_Pop_FE <- Data_Pop_FE %>% mutate(IndicatorCode = factor(IndicatorCode))
str(Data_Pop_FE)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 259777 obs. of 9 variables:
$ Country Name : chr "Arab World" "Arab World" "Arab World" "Arab World" ...
$ CountryCode : Factor w/ 259 levels "ABW","AFG","AGO",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Indicator Name : chr "Population ages 00-04, female" "Population ages 05-09, female" "Population ages 10-14, female" "Population ages 15-19, female" ...
$ IndicatorCode : Factor w/ 17 levels "SP.POP.0004.FE",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Year : Factor w/ 59 levels "1960","1961",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Value : num 8051199 6406768 5043026 4091892 3677343 ...
$ Topic : chr "Health: Population: Structure" "Health: Population: Structure" "Health: Population: Structure" "Health: Population: Structure" ...
$ Periodicity : chr "Annual" "Annual" "Annual" "Annual" ...
$ Aggregation method: chr "Sum" "Sum" "Sum" "Sum" ...
Data_Pop_MA <- Data_Pop_MA %>% mutate(IndicatorCode = factor(IndicatorCode))
str(Data_Pop_MA)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 259777 obs. of 9 variables:
$ Country Name : chr "Arab World" "Arab World" "Arab World" "Arab World" ...
$ CountryCode : Factor w/ 259 levels "ABW","AFG","AGO",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Indicator Name : chr "Population ages 00-04, male" "Population ages 05-09, male" "Population ages 10-14, male" "Population ages 15-19, male" ...
$ IndicatorCode : Factor w/ 17 levels "SP.POP.0004.MA",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Year : Factor w/ 59 levels "1960","1961",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Value : num 8366535 6685409 5347352 4223619 3776630 ...
$ Topic : chr "Health: Population: Structure" "Health: Population: Structure" "Health: Population: Structure" "Health: Population: Structure" ...
$ Periodicity : chr "Annual" "Annual" "Annual" "Annual" ...
$ Aggregation method: chr "Sum" "Sum" "Sum" "Sum" ...
# Label the IndocatorCode and order it
Data_Pop_FE <- Data_Pop_FE %>% mutate(IndicatorCode = factor(IndicatorCode, levels = c("SP.POP.0004.FE","SP.POP.0509.FE","SP.POP.1014.FE","SP.POP.1519.FE","SP.POP.2024.FE","SP.POP.2529.FE","SP.POP.3034.FE","SP.POP.3539.FE","SP.POP.4044.FE","SP.POP.4549.FE","SP.POP.5054.FE","SP.POP.5559.FE","SP.POP.6064.FE","SP.POP.6569.FE","SP.POP.7074.FE","SP.POP.7579.FE","SP.POP.80UP.FE"), labels = c("<4","5-9","10-14","15-19","20-24","25-29","30-34","35-39","40-44","45-49","50-54","55-59","60-64","65-69","70-74","75-79","80>="), ordered = TRUE))
head(Data_Pop_FE, 16)
Data_Pop_MA <- Data_Pop_MA %>% mutate(IndicatorCode = factor(IndicatorCode, levels = c("SP.POP.0004.MA","SP.POP.0509.MA","SP.POP.1014.MA","SP.POP.1519.MA","SP.POP.2024.MA","SP.POP.2529.MA","SP.POP.3034.MA","SP.POP.3539.MA","SP.POP.4044.MA","SP.POP.4549.MA","SP.POP.5054.MA","SP.POP.5559.MA","SP.POP.6064.MA","SP.POP.6569.MA","SP.POP.7074.MA","SP.POP.7579.MA","SP.POP.80UP.MA"), labels = c("<4","5-9","10-14","15-19","20-24","25-29","30-34","35-39","40-44","45-49","50-54","55-59","60-64","65-69","70-74","75-79","80>="), ordered = TRUE))
head(Data_Pop_MA, 16)
Find missning values and replace them with mean value.
length(which(is.na(Data_Pop_FE$Value)))
[1] 25262
length(which(is.na(Data_Pop_MA$Value)))
[1] 25262
# Get all mean value for famale group
Pop_FE_Mean <- Data_Pop_FE %>% group_by(IndicatorCode) %>% summarise(Value = mean(Value, na.rm = TRUE))
kable(Pop_FE_Mean)
| IndicatorCode | Value |
|---|---|
| <4 | 10683790 |
| 5-9 | 9989082 |
| 10-14 | 9392177 |
| 15-19 | 8764482 |
| 20-24 | 8114239 |
| 25-29 | 7459173 |
| 30-34 | 6763063 |
| 35-39 | 6101743 |
| 40-44 | 5477162 |
| 45-49 | 4852887 |
| 50-54 | 4232561 |
| 55-59 | 3631057 |
| 60-64 | 3044799 |
| 65-69 | 2443921 |
| 70-74 | 1860036 |
| 75-79 | 1302138 |
| 80>= | 1313987 |
# Get all mean value for male group
Pop_MA_Mean <- Data_Pop_MA %>% group_by(IndicatorCode) %>% summarise(Value = mean(Value, na.rm = TRUE))
kable(Pop_MA_Mean)
| IndicatorCode | Value |
|---|---|
| <4 | 11282938.2 |
| 5-9 | 10538623.7 |
| 10-14 | 9886519.6 |
| 15-19 | 9187551.1 |
| 20-24 | 8452430.8 |
| 25-29 | 7722641.5 |
| 30-34 | 6969390.3 |
| 35-39 | 6252901.2 |
| 40-44 | 5571914.3 |
| 45-49 | 4879638.8 |
| 50-54 | 4179905.9 |
| 55-59 | 3490302.2 |
| 60-64 | 2812459.0 |
| 65-69 | 2130532.6 |
| 70-74 | 1496447.4 |
| 75-79 | 938941.1 |
| 80>= | 748580.0 |
# Replace mssing values by mean in female group.
Data_Pop_FE <- Data_Pop_FE %>% left_join(Pop_FE_Mean, by="IndicatorCode") %>% mutate(Value = ifelse(is.na(Value.x), Value.y,Value.x)) %>% select(-Value.x, -Value.y)
# Print how many missing value after replacing.
length(which(is.na(Data_Pop_FE$Value)))
[1] 0
# Replace mssing values by mean in male group.
Data_Pop_MA <- Data_Pop_MA %>% left_join(Pop_MA_Mean, by="IndicatorCode") %>% mutate(Value = ifelse(is.na(Value.x), Value.y,Value.x)) %>% select(-Value.x, -Value.y)
# Print how many missing value after replacing.
length(which(is.na(Data_Pop_FE$Value)))
[1] 0
Deal with outliers for female group and male group. Pick up those population ages between 30 to 34 and replace the outliers values by mean
#Pick up all the female population aged between 30 to 34
Data_Pop_FE_30_34 <- Data_Pop_FE %>% filter(IndicatorCode == "30-34")
head(Data_Pop_FE_30_34, 16)
# Get z scroes
z.scores <- Data_Pop_FE_30_34$Value %>% scores(type="z")
z.scores %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.31186 -0.30828 -0.29648 0.00000 -0.07227 12.94023
# Replace outliers by mean
Data_Pop_FE_30_34$Value[which(abs(z.scores) > 3)] <- mean(Data_Pop_FE_30_34$Value, na.rm = TRUE)
#Pick up all the female population aged between 30 to 34
Data_Pop_MA_30_34 <- Data_Pop_MA %>% filter(IndicatorCode == "30-34")
head(Data_Pop_MA_30_34, 16)
# Get z scroes
z.scores <- Data_Pop_MA_30_34$Value %>% scores(type="z")
z.scores %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.31060 -0.30699 -0.29581 0.00000 -0.07884 12.99371
# Replace outliers by mean
Data_Pop_MA_30_34$Value[which(abs(z.scores) > 3)] <- mean(Data_Pop_MA_30_34$Value, na.rm = TRUE)
Apply an appropriate transformation for at least one of the variables. In addition to the R codes and outputs, explain everything that you do in this step. In this step, you should fulfil the minimum requirement #9.
# Print historgrams before transform
hist(Data_Pop_FE$Value, main = "Histogram Data in Female Group", xlab = "Population")
# Apply transform to value
Data_Pop_FE$Value <- log10(Data_Pop_FE$Value)
# Print histograms after transform.
hist(Data_Pop_FE$Value, main = "Histogram Data in Female Group",xlab = "Population")
# Print historgrams after transform
hist(Data_Pop_MA$Value, main = "Histogram Data in Male Group")
# Apply transform to value
Data_Pop_MA$Value <- log10(Data_Pop_MA$Value)
# Print historgrams after transform
hist(Data_Pop_MA$Value, main = "Histogram Data in Male Group",xlab = "Population")