1 Final Project Proposal Requirement

Your final project is to create a public visualization using data relevant to a current policy, business, or justice issue. You may use any dataset you can find for this assignment, as long as it is either public or you have permission from the data’s owner/administrator to work with it and share it.

Recommended data sources are: governmental data, data provided by a non-profit/Nongovernmental organizations, and data available from large, semi-structured data sets (ie social networks, company financials, etc).

You must document each step of your data analysis process (excluding data acquisition) in code: this will include changing the format of the data and the creation of any images or interactive displays that are made.

You must also include a short (2-3 paragraph) write-up on the visualization. This write-up must include the following: the data source, what the parameters of the data set are (geography, timeframe, what the data points are, etc) what the data shows, and why it is important.

Grading:

This assignment will account for 40% of your final grade. Points will be awarded for the following components:

25% - finding your dataset(s) and getting approval for your project on-time, recognition of strength/weaknesses of data, analysis to find insights in the data
25% - data handling: cleaning, outlier/null handling, and transfer/loading data to the web
40% - data presentation: compliance with best data visualization practices, clarity, information-to-ink ratio, how memorable the visualization is
10% - contextual write-up: why the data is important, why the insights are important Due

Dates:

Note - The type of deliverable you provide will depend on the strategy you use for this project. If you put together an interactive visualization, you should be able to provide code that I will be able to run and host locally. If you are choosing static visualizations, your write up will be more important to your overall grade, and it may be useful to think about how you’re presenting these visualizations (in a formated R markdown document for example.)

Proposal

You must submit a proposal for your project by 03/26. This proposal must include: a link to the data source, an explanation of what you want to show, why this is relevant to a current policy, business, or justice issue, and which technologies you plan to use.

Your instructor must approve this proposal: you may have to refine this somewhat. You will present your final project during our last meetup. If you are not able to attend the lecture on those days, you must write-up a status report with screenshots of current progress and issues you are experiencing.

2 Pre-Requistes : Available Libraries

knitr
plyr
dplyr
sqldf
data.table
DT
kableExtra
ggplot2
reshape2
ggplot2
plotly
graphics
ggthemes
googleVis

3 Final Project Proposal

3.1 Aim:

The goal of this project is to:

Does demographics like Age, Sex, Location have impact on types of death, economic status?
Show the merit and demerit of visual analytics on data analysis, and how to improve it.
Make the visualized image(s) tells the story in an unambiguous way, understandable, even to a layman!

3.2 Methodology:

The project would lay more emphasis on the explanatory techniques. It will be used in making data presentation to the viewers in a more succinct way. I therefore plan to use the R programing language to explore and analysis the dataset.

The dataset to be used is the World Health Nutrition and Population Statistics from year 2000 to 2019 . This can be obtained from DataBankHealth Nutrition and Population Statistics, last updated on 12/20/2019.

Load the source dataset

df <- read.csv("World Health Nutrition and Population Statistics_2020-2020.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)
knitr::kable(head(df[200:206, ]))

	Year	Country_Name	Country_Code	Adults_15_living_HIV	Adults_Children_0_14_15_living_HIV	AIDS_estimated_deaths_UNAIDS	Adults_children_0_14_15_newly_infected_HIV	Adults_15_newly_infected_HIV	Children_0_14_living_with_HIV	Children_orphaned_by_HIV_AIDS	Children_0_14_newly_infected_HIV	Incidence_tuberculosis_per_100000	Labor_force_total	Mortality_traffic_injury_100K	Population_female	Population_male	Population_total	Malaria_cases_reported	Suicide_mortality_per_100K	Tuberculosis_death_per_100K	Tuberculosis_case_detection	Tuberculosis_treatment_success_NewCases
200	2000	Ukraine	UKR	170000	170000	4500	29000	29000	760	5600	500	114.0	23221521	NA	26284954	22890894	49175848	NA	36.90000	23.00	59	NA
201	2000	Upper middle income	UMC	NA	NA	NA	NA	NA	NA	NA	NA	103.0	1179100250	NA	1149821735	1165590459	2317310149	NA	14.01896	9.80	48	81
202	2000	Uruguay	URY	5900	6000	500	590	570	100	1100	100	22.0	1567214	NA	1713077	1606659	3319736	NA	17.40000	2.30	87	85
203	2000	United States	USA	NA	NA	NA	NA	NA	NA	NA	NA	6.7	146767130	NA	143178430	138983981	282162411	NA	11.30000	0.32	87	83
204	2000	Uzbekistan	UZB	14000	14000	840	2100	1900	730	7600	200	99.0	9733490	NA	12392841	12257559	24650400	126	7.60000	16.00	64	80
205	2000	St. Vincent and the Grenadines	VCT	NA	NA	NA	NA	NA	NA	NA	NA	17.0	48284	NA	53486	54298	107784	NA	6.30000	3.20	87	100

World longitudes and latitudes

lat_long <- read.csv("Countries-longitude_latitude.csv", header = TRUE, sep = ",")
colnames(lat_long) <- c("Country", "Country_Code", "Latitude", "Longtitude")
knitr::kable(head(lat_long))

Country	Country_Code	Latitude	Longtitude
Afghanistan	AFG	33.0000	65.0
Albania	ALB	41.0000	20.0
Algeria	DZA	28.0000	3.0
American Samoa	ASM	-14.3333	-170.0
Andorra	AND	42.5000	1.6
Angola	AGO	-12.5000	18.5

Cleaning and renaming of dataset and column respectively.

options(warn = -1)
df2 <- merge(df, lat_long, by.x = "Country_Code", by.y = "Country_Code", all = FALSE)
df2[, 5:11] <- sapply(df2[, 5:11], as.numeric)

Merging column Longitude and Latitude together for a better coordinate to be in maps (googlevis)

df2$Lat_Long = paste(df2$Latitude, df2$Longtitude, sep=":")
knitr::kable(head(df2))

Country_Code	Year	Country_Name	Adults_15_living_HIV	Adults_Children_0_14_15_living_HIV	AIDS_estimated_deaths_UNAIDS	Adults_children_0_14_15_newly_infected_HIV	Adults_15_newly_infected_HIV	Children_0_14_living_with_HIV	Children_orphaned_by_HIV_AIDS	Children_0_14_newly_infected_HIV	Incidence_tuberculosis_per_100000	Labor_force_total	Mortality_traffic_injury_100K	Population_female	Population_male	Population_total	Malaria_cases_reported	Suicide_mortality_per_100K	Tuberculosis_death_per_100K	Tuberculosis_case_detection	Tuberculosis_treatment_success_NewCases	Country	Latitude	Longtitude	Lat_Long
ABW	2006	Aruba	NA	NA	NA	NA	NA	NA	NA	NA	8.9	NA	NA	52897	47937	100834	NA	NA	0.73	NA	NA	Aruba	12.52111	-69.9667	12.52111:-69.9667
ABW	2015	Aruba	NA	NA	NA	NA	NA	NA	NA	NA	11.0	NA	NA	54743	49598	104341	NA	NA	0.92	NA	NA	Aruba	12.52111	-69.9667	12.52111:-69.9667
ABW	2017	Aruba	NA	NA	NA	NA	NA	NA	NA	NA	8.7	NA	NA	55331	50035	105366	NA	NA	0.72	87	NA	Aruba	12.52111	-69.9667	12.52111:-69.9667
ABW	2005	Aruba	NA	NA	NA	NA	NA	NA	NA	NA	8.6	NA	NA	52456	47575	100031	NA	NA	0.71	NA	NA	Aruba	12.52111	-69.9667	12.52111:-69.9667
ABW	2010	Aruba	NA	NA	NA	NA	NA	NA	NA	NA	6.8	NA	NA	53202	48467	101669	NA	NA	0.56	87	NA	Aruba	12.52111	-69.9667	12.52111:-69.9667
ABW	2003	Aruba	NA	NA	NA	NA	NA	NA	NA	NA	8.1	NA	NA	50707	46310	97017	NA	NA	0.67	NA	NA	Aruba	12.52111	-69.9667	12.52111:-69.9667

We are now to goint make use of sql to subset(query) columns so as to diffentiate between year 2000 and 2019 where the number children orphaned by HIV/AIDS more than 50000.

Twentyfirst_Cen <- sqldf("SELECT Country_Name, Year, Lat_Long, Population_total, (Children_orphaned_by_HIV_AIDS/Population_total)*100 as 'Percentage_Orphaned_byHIV' FROM df2 where Percentage_Orphaned_byHIV >= 2 ORDER BY Percentage_Orphaned_byHIV DESC LIMIT 50")

head(Twentyfirst_Cen)

##   Country_Name Year Lat_Long Population_total Percentage_Orphaned_byHIV
## 1     Zimbabwe 2004   -20:30         12019912                  8.319528
## 2     Zimbabwe 2005   -20:30         12076699                  8.280408
## 3     Zimbabwe 2006   -20:30         12155491                  8.226735
## 4     Zimbabwe 2003   -20:30         11982224                  8.178782
## 5     Zimbabwe 2007   -20:30         12255922                  8.159321
## 6     Zimbabwe 2008   -20:30         12379549                  8.077839

The world map showing the countries where children are orphaned by HIV/AIDS (2000-2019)

Show_map <- googleVis::gvisGeoChart(Twentyfirst_Cen, locationvar ="Lat_Long", hovervar ="Country_Name", sizevar = "Percentage_Orphaned_byHIV", colorvar = "Population_total",
                   options=list(displayMode="Markers", 
                                colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
                                backgroundColor="lightblue"), 
                   chartid="Lost_Their_Parents_To_HIV_AIDS")
plot(Show_map)

From the map above, we can see that majority of countries where more that 50000 children lost their parents to HIV/AIDS are in southern part of Africa.

library(graphics)
plotly::ggplotly(ggplot2::ggplot(Twentyfirst_Cen, ggplot2::aes(x=Percentage_Orphaned_byHIV, y=reorder(Country_Name, +Percentage_Orphaned_byHIV), fill=Year)) +
    geom_point(colour="purple", size=2, alpha=.8) +
    scale_fill_brewer(palette="Blues", breaks=rev(levels(Twentyfirst_Cen$Year))) + 
    labs(title="Chart of Children Orphaned By HIV/AIDS BY Countries (%)"))

The chart below depicts the 20-21st Century Countries Where Most Children Are Losing Their Parent To HIV/AIDS

Combo <- googleVis::gvisComboChart(Twentyfirst_Cen, xvar="Country_Name", yvar="Percentage_Orphaned_byHIV",
                                   options=list(seriesType="bars",
                                                bar="{groupWidth:'100%'}",
                                                title="Interactive Chart Of Countries And Related HIV/AIDS Orphaned ",
                                                series='{0: {type:"line"}}'),chartid = "ER")
plot(Combo)

Malaria_Inc_Map <- googleVis::gvisGeoChart(Twentyfirst_Cen, locationvar ="Lat_Long",  hovervar ="Country_Name",sizevar = "Percentage_Orphaned_byHIV", colorvar = "Population_total",
                   options=list(displayMode="Markers", 
                               colorAxis="{colors:['purple', 'red', 'orange', 'grey', 'pink']}",
                               backgroundColor="lightblue"), 
                   chartid="Lost_Their_Parents_To_HIV_AIDS")
plot(Malaria_Inc_Map )

The chart shows that majority of the HIV/AIDS related death were rampant in the late Nineteen century than it were in the 20-21th century.

DATA 608 Final Project Proposal

Debabrata Kabiraj

March 21, 2020

1 Final Project Proposal Requirement

2 Pre-Requistes : Available Libraries

3 Final Project Proposal

3.1 Aim:

3.2 Methodology: