Data Storytelling Statistics

Recreating Hans Rosling’s stories with Data Introduction A lot of temporal data is been collected through modern computing technologies. These huge collections of data, if studied, can provide valuable insights that can help make informed decisions. However, this also presents challenges in terms of how these insights should be presented in order to get the desired response from the audience. One avenue that is gaining popularity is using data-driven stories to share the insights contained in the data.

In this work, I analyse the packages in R that can be used to identify the facts that are presented in Hans Rosling’s data stories. Examples of Hans Rosling’s data stories can be found on GAPMINDER website (URL: https://www.gapminder.org/videos/200-years-that-changed-the-world/). Hans Rosling explains the trends in public health matters. in his data stories, Hans demonstrates the trends in the data. He identifies points in times which he uses as reference points in explain the state of the world at that point, or explain how the word has changed after that point in time.In addition, he compare the trends in specific countries or regions.

human-in the loop concept can be introduced in by specifying individual events and identifying the effects these have on the time series.

R Packages for decomposing the data stories Changepoints in the time series detecting points of change (called Changepoint or breakpoints) is important in a time series. change points or breakpoints are abrupt variations in time series data. Change point analysis seeks to identify the point or points in a time series at which a change , or break in the trend occurs. it is widely applied in social and/or natural science questions where researchers are interested in examining the points of time at which a statistically significant change in the quantity being studied occur. studying these changepoints may help to explain the effects of some activities or the results of human decisions at some point in time.

strucchange::Breakpoints: Whenever there change in the standards for defining and observing a variable over time, we call this a strucure breakpoint. whenever there is an unexpected change in a time series we call this a structure break. structure breaks is as a result of exogeneus (external) factor ( for example, a pandemic, change in goverment policy or some other unexpected factor). These external factors may cause change in the structure of the time series, thus introducing a structure breakpoint. it is important to identify Structure change because if a structure change exists in a time series, and its is not detected, it will lead to forecasting errors, or unreliability of the model. Strucchange Package is required to in order to detect the structure changes. Strucchange has been applied in several research activities. For example, Davis et al. (https://www.pnas.org/doi/epdf/10.1073/pnas.1815107116) used it to detect changes in wildfires due to climate change. using strucchange::breakpoints, the detected that there was a change in
I am going to use Strucchange::Breakpoint package to identify the points in a time series where change takes place. I will then explain the trends between successive breakpoints. Human intervention will come in specifying any external factor that could have led to the change at the specific changepoint.

Steps for using breakpoint package: Step 1: library(strucchange) #loading the strucchange library Step 2:plot.ts #Plotting the time series Step 3: create model variableName = Fstats( dataset~1,start date, end date) Step 4: to find the structural break points (strucchange::breakpoints(data)) -After determining the breakpoints, there is need to see from the literature if there are external factors that happens at the breakpoints. (this is where the human input comes in) Step 5: structural breakpoints tests

Dataset In this research, I use data from gapminder.org. The dataset contains timeseries data on income per capita, life expectancy, population, and average number of children per woman in child-bearing range, among other attributes. Data on each of these attributes is stored as a separate dataset. After downloading the data from gapminder.org, I merge them into one dataset. I use the merged dataset to recreate the data stories as told by Hans Rosling, and demonstrate how packages in R can be used to identify significant statistics to be included in the data story.

#Loading the libraries

#Loading the data lets load the data:

#Reshaping the data

#Merging the dataset The data from gapminder are merged into one dataset. the merged dataset will be used when I incorporate the animated visualization in the analysis.

#Filtering the content

HighIncomeCountries.consolidatedData = consolidatedData %>% dplyr::filter(consolidatedData$Economic_Group=="High income") 
LowerMiddleIncomeCountries.consolidatedData = consolidatedData %>% dplyr::filter(consolidatedData$Economic_Group =="Lower middle income")
LowIncomeCountries.consolidatedData = consolidatedData %>% dplyr::filter(consolidatedData$Economic_Group =="Low income")
UpperMiddleIncomeCountries.consolidatedData = consolidatedData %>% dplyr::filter(consolidatedData$Economic_Group =="Upper middle income")

#Extracting data for UK to use in the preliminary analysis

UK.Data = HighIncomeCountries.consolidatedData %>% dplyr::filter (HighIncomeCountries.consolidatedData$Country == "United Kingdom") 

UKData.totalPopulation <- UK.Data[,4]  

UKData.fertilityRate <- UK.Data[,9]  
UKData.GDPPerCapita <- UK.Data[,5]  
UKData.lifeExpectancy<- UK.Data[,6]  

#Presenting UK Data as a Time Series
UKFertilityRatesTimeSeries <- ts(UKData.fertilityRate,  start=c(1800), frequency = 1)

head(UKFertilityRatesTimeSeries)

## Time Series:
## Start = 1800 
## End = 1805 
## Frequency = 1 
## [1] 4.60 5.30 5.61 5.65 5.55 5.49

plot.ts(UKFertilityRatesTimeSeries)

#In the following code I was trying to visualise a smoothened curve
#UKFertilityRatesTimeSeriesSMA3 <- TTR::SMA(UKFertilityRatesTimeSeries, n=3)
#plot.ts(UKFertilityRatesTimeSeriesSMA3)

#UKFertilityRatesTimeSeriesSMA5 <- TTR::SMA(UKFertilityRatesTimeSeries, n=5)
#plot.ts(UKFertilityRatesTimeSeriesSMA5)

#UKFertilityRatesTimeSeriesSMA8 <- TTR::SMA(UKFertilityRatesTimeSeries, n=8)
#plot.ts(UKFertilityRatesTimeSeriesSMA8)

#UKFetilityRatesTimeSeriesComponents <- decompose(UKFetilityRatesTimeSeries)

#plot(UKFetilityRatesTimeSeriesComponents)

# <- population.estimate[,c(2,25:60)] 
#View(UK.Data)

#visualising the time series and trying to fit in the trend line
#plot_ly(x = UK.Data$Year, y = UK.Data$Life_Expectancy) %>% add_lines(y = UK.Data$Life_Expectancy) %>% layout(xaxis = list(title = "Year"), yaxis = list(title = "Life Expectancy"))

Data Storytelling Statistics

2022-05-09