The metropolitan or micropolitan statistics areas (MMSAs) data based were selected among the Survey of Metropolitan/Micropolitan Area Risk Trends (SMART) BRFSS MMSA data. The SMART comprised of 486238 or more participants who responded in the 2016 BRFSS combined landline telephone and cellular telephone survey. The aggregate Behavioral Risk Factor Surveillance Survey (BRFSS) landline and cell phone dataset was built from the landline and cell phone data submitted for 2016 and includes data for 50 states, the District of Columbia, Guam, and Puerto Rico. The dataset is available at https://www.cdc.gov/brfss/annual_data/annual_2016.html The file for the combined landline and cell phone data set was exported from SAS V9.3 in the XPT transport format. This file contains 275 variables, this dataset is very untidy, I was able to import the XPT transport format to R. Some of the variable labels are truncated in the process of converting the XPT format. Tidying and transforming this data is required for proper analysis.
getwd()
## [1] "C:/Users/Emahayz_Pro/Desktop/Data_Science/Data 607/Week6"
setwd("C:/Users/Emahayz_Pro/Desktop/Data_Science/Data 607/Week6")
library(foreign)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
This data is too large, GitHub rejected it due to size limitation Dataset has 486,303 observations with 275 variables
CDC <- read.xport("LLCP2016.XPT", sep = ",")
str(CDC)
## 'data.frame': 486303 obs. of 275 variables:
## $ X_STATE : num 1 1 1 1 1 1 1 1 1 1 ...
## $ FMONTH : num 1 1 1 1 1 1 1 1 1 1 ...
## $ IDATE : Factor w/ 419 levels "01012017","01022017",..: 10 18 8 12 6 10 12 10 10 52 ...
## $ IMONTH : Factor w/ 12 levels "01","02","03",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ IDAY : Factor w/ 31 levels "01","02","03",..: 7 11 6 8 5 7 8 7 7 28 ...
## $ IYEAR : Factor w/ 2 levels "2016","2017": 1 1 1 1 1 1 1 1 1 1 ...
## $ DISPCODE : num 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 ...
## $ SEQNO : Factor w/ 36955 levels "2016000001","2016000002",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ X_PSU : num 2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
## $ CTELENM1 : num 1 1 1 1 1 1 1 1 1 1 ...
## $ PVTRESD1 : num 1 1 1 1 1 1 1 1 1 1 ...
## $ COLGHOUS : num NA NA NA NA NA NA NA NA NA NA ...
## $ STATERES : num 1 1 1 1 1 1 1 1 1 1 ...
## $ CELLFON4 : num 2 2 2 2 2 2 2 2 2 2 ...
## $ LADULT : num NA NA NA NA NA NA NA NA NA NA ...
## $ NUMADULT : num 2 2 1 2 3 3 2 2 3 1 ...
## $ NUMMEN : num 1 1 0 1 2 1 1 1 1 1 ...
## $ NUMWOMEN : num 1 1 1 1 1 2 1 1 2 0 ...
## $ CTELNUM1 : num NA NA NA NA NA NA NA NA NA NA ...
## $ CELLFON5 : num NA NA NA NA NA NA NA NA NA NA ...
## $ CADULT : num NA NA NA NA NA NA NA NA NA NA ...
## $ PVTRESD3 : num NA NA NA NA NA NA NA NA NA NA ...
## $ CCLGHOUS : num NA NA NA NA NA NA NA NA NA NA ...
## $ CSTATE1 : num NA NA NA NA NA NA NA NA NA NA ...
## $ LANDLINE : num NA NA NA NA NA NA NA NA NA NA ...
## $ HHADULT : num NA NA NA NA NA NA NA NA NA NA ...
## $ GENHLTH : num 1 2 3 3 5 3 3 3 2 2 ...
## $ PHYSHLTH : num 88 88 88 88 10 88 88 88 88 88 ...
## $ MENTHLTH : num 88 88 1 88 3 88 88 88 88 88 ...
## $ POORHLTH : num NA NA 88 NA 3 NA NA NA NA NA ...
## $ HLTHPLN1 : num 1 1 1 1 1 1 1 1 1 1 ...
## $ PERSDOC2 : num 2 1 2 1 1 1 2 1 1 1 ...
## $ MEDCOST : num 2 2 2 2 2 2 2 2 2 2 ...
## $ CHECKUP1 : num 1 1 1 1 1 1 1 1 1 1 ...
## $ EXERANY2 : num 1 1 1 1 1 2 1 1 1 1 ...
## $ SLEPTIM1 : num 5 6 9 7 6 8 7 8 8 7 ...
## $ CVDINFR4 : num 2 2 1 2 2 2 2 2 2 2 ...
## $ CVDCRHD4 : num 2 2 2 2 2 2 2 2 2 2 ...
## $ CVDSTRK3 : num 2 2 2 2 2 2 2 2 2 2 ...
## $ ASTHMA3 : num 2 2 2 2 1 2 2 1 2 2 ...
## $ ASTHNOW : num NA NA NA NA 2 NA NA 1 NA NA ...
## $ CHCSCNCR : num 2 2 1 1 2 2 1 1 2 2 ...
## $ CHCOCNCR : num 2 2 2 2 2 2 2 2 2 2 ...
## $ CHCCOPD1 : num 2 2 2 2 2 2 2 2 2 2 ...
## $ HAVARTH3 : num 2 2 1 1 2 2 1 1 2 2 ...
## $ ADDEPEV2 : num 2 1 2 2 2 2 2 1 2 2 ...
## $ CHCKIDNY : num 2 2 2 2 2 2 2 2 2 2 ...
## $ DIABETE3 : num 3 3 3 1 3 3 3 1 3 3 ...
## $ DIABAGE2 : num NA NA NA 62 NA NA NA 75 NA NA ...
## $ LASTDEN3 : num 1 1 2 1 1 3 1 2 1 1 ...
## $ RMVTETH3 : num 1 2 8 2 8 1 8 8 8 1 ...
## $ SEX : num 1 2 2 1 1 2 2 2 2 1 ...
## $ MARITAL : num 1 1 3 1 5 3 1 1 1 5 ...
## $ EDUCA : num 4 4 5 6 4 4 4 5 5 4 ...
## $ RENTHOM1 : num 1 1 1 1 3 1 1 1 1 1 ...
## $ NUMHHOL2 : num 2 2 2 2 2 2 2 2 2 2 ...
## $ NUMPHON2 : num NA NA NA NA NA NA NA NA NA NA ...
## $ CPDEMO1 : num 1 1 1 1 1 1 1 1 1 1 ...
## $ VETERAN3 : num 1 2 2 1 2 2 2 2 2 2 ...
## $ EMPLOY1 : num 1 5 7 7 1 1 7 7 1 2 ...
## $ CHILDREN : num 3 88 88 88 1 88 88 88 88 88 ...
## $ INCOME2 : num 5 7 7 7 77 99 77 7 8 6 ...
## $ INTERNET : num 1 1 1 1 1 1 1 1 1 2 ...
## $ WEIGHT2 : num 135 153 120 220 125 ...
## $ HEIGHT3 : num 508 503 505 601 506 507 501 503 506 600 ...
## $ PREGNANT : num NA NA NA NA NA NA NA NA 2 NA ...
## $ DEAF : num 2 2 2 2 2 2 2 2 2 2 ...
## $ BLIND : num 2 2 2 2 2 2 2 2 2 2 ...
## $ DECIDE : num 2 2 2 2 2 2 2 2 2 2 ...
## $ DIFFWALK : num 2 2 2 2 2 2 2 1 2 2 ...
## $ DIFFDRES : num 2 2 2 2 2 2 2 2 2 2 ...
## $ DIFFALON : num 2 2 2 2 2 2 2 2 2 2 ...
## $ SMOKE100 : num 2 2 1 1 2 2 1 2 2 2 ...
## $ SMOKDAY2 : num NA NA 1 3 NA NA 3 NA NA NA ...
## $ STOPSMK2 : num NA NA 2 NA NA NA NA NA NA NA ...
## $ LASTSMK2 : num NA NA NA 7 NA NA 7 NA NA NA ...
## $ USENOW3 : num 3 3 3 3 3 3 3 3 3 3 ...
## $ ECIGARET : num 2 2 2 2 2 2 2 2 2 2 ...
## $ ECIGNOW : num NA NA NA NA NA NA NA NA NA NA ...
## $ ALCDAY5 : num 888 888 101 888 888 888 205 888 888 201 ...
## $ AVEDRNK2 : num NA NA 1 NA NA NA 1 NA NA 1 ...
## $ DRNK3GE5 : num NA NA 88 NA NA NA 88 NA NA 88 ...
## $ MAXDRNKS : num NA NA 2 NA NA NA 2 NA NA 1 ...
## $ FLUSHOT6 : num 1 2 1 1 2 2 1 1 1 1 ...
## $ FLSHTMY2 : num 12015 NA 102015 102015 NA ...
## $ PNEUVAC3 : num 7 7 1 1 2 1 1 1 2 2 ...
## $ TETANUS : num 3 4 4 4 4 4 7 4 1 3 ...
## $ FALL12MN : num NA 1 88 88 NA 1 88 1 NA 3 ...
## $ FALLINJ2 : num NA 88 NA NA NA 88 NA 1 NA 88 ...
## $ SEATBELT : num 5 1 1 5 1 1 1 1 1 1 ...
## $ DRNKDRI2 : num NA NA 88 NA NA NA 88 NA NA 88 ...
## $ HADMAM : num NA 1 1 NA NA 1 1 1 1 NA ...
## $ HOWLONG : num NA 1 4 NA NA 1 2 2 1 NA ...
## $ HADPAP2 : num NA 1 1 NA NA 1 1 1 1 NA ...
## $ LASTPAP2 : num NA 1 5 NA NA 5 7 5 1 NA ...
## $ HPVTEST : num NA 2 2 NA NA 2 2 2 1 NA ...
## $ HPLSTTST : num NA NA NA NA NA NA NA NA 1 NA ...
## $ HADHYST2 : num NA 1 2 NA NA 1 1 1 2 NA ...
## $ PCPSAAD2 : num 2 NA NA 1 NA NA NA NA NA 1 ...
## [list output truncated]
CDC$STATE[CDC$X_STATE=="1"] <- "Alabama"
CDC$STATE[CDC$X_STATE=="2"] <- "Alaska"
CDC$STATE[CDC$X_STATE=="4"] <- "Arizona"
CDC$STATE[CDC$X_STATE=="5"] <- "Arkansas"
CDC$STATE[CDC$X_STATE=="6"] <- "California"
CDC$STATE[CDC$X_STATE=="8"] <- "Colorado"
CDC$STATE[CDC$X_STATE=="9"] <- "Connecticut"
CDC$STATE[CDC$X_STATE=="10"] <- "Delaware"
CDC$STATE[CDC$X_STATE=="11"] <- "District of Columbia"
CDC$STATE[CDC$X_STATE=="12"] <- "Florida"
CDC$STATE[CDC$X_STATE=="13"] <- "Georgia"
CDC$STATE[CDC$X_STATE=="15"] <- "Hawaii"
CDC$STATE[CDC$X_STATE=="16"] <- "Idaho"
CDC$STATE[CDC$X_STATE=="17"] <- "Illinois"
CDC$STATE[CDC$X_STATE=="18"] <- "Indiana"
CDC$STATE[CDC$X_STATE=="19"] <- "Iowa"
CDC$STATE[CDC$X_STATE=="20"] <- "Kansas"
CDC$STATE[CDC$X_STATE=="21"] <- "Kentucky"
CDC$STATE[CDC$X_STATE=="22"] <- "Louisiana"
CDC$STATE[CDC$X_STATE=="23"] <- "Maine"
CDC$STATE[CDC$X_STATE=="24"] <- "Maryland"
CDC$STATE[CDC$X_STATE=="25"] <- "Massachusetts"
CDC$STATE[CDC$X_STATE=="26"] <- "Michigan"
CDC$STATE[CDC$X_STATE=="27"] <- "Minnesota"
CDC$STATE[CDC$X_STATE=="28"] <- "Mississippi"
CDC$STATE[CDC$X_STATE=="29"] <- "Missouri"
CDC$STATE[CDC$X_STATE=="30"] <- "Montana"
CDC$STATE[CDC$X_STATE=="31"] <- "Nebraska"
CDC$STATE[CDC$X_STATE=="32"] <- "Nevada"
CDC$STATE[CDC$X_STATE=="33"] <- "New Hampshire"
CDC$STATE[CDC$X_STATE=="34"] <- "New Jersey"
CDC$STATE[CDC$X_STATE=="35"] <- "New Mexico"
CDC$STATE[CDC$X_STATE=="36"] <- "New York"
CDC$STATE[CDC$X_STATE=="37"] <- "North Carolina"
CDC$STATE[CDC$X_STATE=="38"] <- "North Dakota"
CDC$STATE[CDC$X_STATE=="39"] <- "Ohio"
CDC$STATE[CDC$X_STATE=="40"] <- "Oklahoma"
CDC$STATE[CDC$X_STATE=="41"] <- "Oregon"
CDC$STATE[CDC$X_STATE=="42"] <- "Pennsylvania"
CDC$STATE[CDC$X_STATE=="44"] <- "Rhode Island"
CDC$STATE[CDC$X_STATE=="45"] <- "South Carolina"
CDC$STATE[CDC$X_STATE=="46"] <- "South Dakota"
CDC$STATE[CDC$X_STATE=="47"] <- "Tennessee"
CDC$STATE[CDC$X_STATE=="48"] <- "Texas"
CDC$STATE[CDC$X_STATE=="49"] <- "Utah"
CDC$STATE[CDC$X_STATE=="50"] <- "Vermont"
CDC$STATE[CDC$X_STATE=="51"] <- "Virginia"
CDC$STATE[CDC$X_STATE=="53"] <- "Washington"
CDC$STATE[CDC$X_STATE=="54"] <- "West Virginia"
CDC$STATE[CDC$X_STATE=="55"] <- "Wisconsin"
CDC$STATE[CDC$X_STATE=="56"] <- "Wyoming"
CDC$STATE[CDC$X_STATE=="66"] <- "Guam"
CDC$STATE[CDC$X_STATE=="72"] <- "Puerto Rico"
CDC$STATE[CDC$X_STATE=="78"] <- "Virgin Islands"
Creating a subset of the data: I’m using dplyr for subseting with 11 variables
CDC_New <- CDC%>%
select(STATE, VETERAN3,GENHLTH,CHECKUP1,CVDINFR4,CVDCRHD4,CVDSTRK3,SEX, MARITAL,EDUCA,INCOME2,WEIGHT2,SMOKDAY2)
Data Cleaning and Preprocessing Cleaning the data by removing NA
CDC_New1 <- na.omit(CDC_New)
Lets view the new structure
str(CDC_New1)
## 'data.frame': 204093 obs. of 13 variables:
## $ STATE : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ VETERAN3: num 2 1 2 2 2 2 1 2 2 2 ...
## $ GENHLTH : num 3 3 3 4 3 2 3 3 3 4 ...
## $ CHECKUP1: num 1 1 1 1 3 3 1 2 1 2 ...
## $ CVDINFR4: num 1 2 2 2 2 2 2 2 2 2 ...
## $ CVDCRHD4: num 2 2 2 2 2 2 2 2 2 2 ...
## $ CVDSTRK3: num 2 2 2 2 2 2 2 2 1 2 ...
## $ SEX : num 2 1 2 2 2 2 1 1 1 2 ...
## $ MARITAL : num 3 1 1 2 1 1 3 1 1 2 ...
## $ EDUCA : num 5 6 4 3 6 6 6 3 5 6 ...
## $ INCOME2 : num 7 7 77 1 3 8 99 5 8 77 ...
## $ WEIGHT2 : num 120 220 130 145 215 141 190 140 165 200 ...
## $ SMOKDAY2: num 1 3 3 1 3 3 3 2 3 3 ...
## - attr(*, "na.action")= 'omit' Named int 1 2 5 6 8 9 10 11 12 13 ...
## ..- attr(*, "names")= chr "1" "2" "5" "6" ...
Tidy the data using the gather function of tidyr package to transform the dataframe from wide to long format
long_CDC_New1 <- CDC_New1 %>% gather(CVD, PRESENT, CVDINFR4,CVDCRHD4,CVDSTRK3)
head(long_CDC_New1, 24)
## STATE VETERAN3 GENHLTH CHECKUP1 SEX MARITAL EDUCA INCOME2 WEIGHT2
## 1 Alabama 2 3 1 2 3 5 7 120
## 2 Alabama 1 3 1 1 1 6 7 220
## 3 Alabama 2 3 1 2 1 4 77 130
## 4 Alabama 2 4 1 2 2 3 1 145
## 5 Alabama 2 3 3 2 1 6 3 215
## 6 Alabama 2 2 3 2 1 6 8 141
## 7 Alabama 1 3 1 1 3 6 99 190
## 8 Alabama 2 3 2 1 1 3 5 140
## 9 Alabama 2 3 1 1 1 5 8 165
## 10 Alabama 2 4 2 2 2 6 77 200
## 11 Alabama 2 2 1 2 3 5 6 120
## 12 Alabama 2 3 1 1 6 4 4 280
## 13 Alabama 2 3 1 1 1 4 5 255
## 14 Alabama 2 4 1 2 1 3 3 130
## 15 Alabama 2 3 1 1 2 5 2 170
## 16 Alabama 2 2 2 2 1 6 8 225
## 17 Alabama 2 1 3 2 1 4 8 140
## 18 Alabama 2 3 1 1 2 5 6 238
## 19 Alabama 2 1 1 2 3 5 77 148
## 20 Alabama 2 3 3 1 2 4 5 190
## 21 Alabama 2 5 1 1 2 5 5 240
## 22 Alabama 1 1 1 1 1 6 77 190
## 23 Alabama 2 3 1 1 9 3 99 155
## 24 Alabama 2 2 1 1 3 3 3 230
## SMOKDAY2 CVD PRESENT
## 1 1 CVDINFR4 1
## 2 3 CVDINFR4 2
## 3 3 CVDINFR4 2
## 4 1 CVDINFR4 2
## 5 3 CVDINFR4 2
## 6 3 CVDINFR4 2
## 7 3 CVDINFR4 2
## 8 2 CVDINFR4 2
## 9 3 CVDINFR4 2
## 10 3 CVDINFR4 2
## 11 3 CVDINFR4 2
## 12 2 CVDINFR4 7
## 13 3 CVDINFR4 2
## 14 2 CVDINFR4 2
## 15 2 CVDINFR4 2
## 16 3 CVDINFR4 2
## 17 3 CVDINFR4 2
## 18 1 CVDINFR4 2
## 19 3 CVDINFR4 2
## 20 3 CVDINFR4 2
## 21 1 CVDINFR4 2
## 22 3 CVDINFR4 2
## 23 2 CVDINFR4 2
## 24 1 CVDINFR4 2
Vizualizing the CVD by state
ggplot() + geom_bar(aes(y = CVD, x = STATE,fill =CVD), data = long_CDC_New1,stat="identity")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
labs( x="States", y="Cadiovascular Disease")
Write CSV in R
write.csv(long_CDC_New1, file = "CDC_BRFSS_Survey.csv")
I used a long format for my final dataframe where I compressed the 3 columns of cadiovascular disease (CVDINFR4,CVDCRHD4,CVDSTRK3) into one column cadiovascular diseases (CVD) where CVDINFR4 is heart attack also called a myocardial infarction, CVDCRHD4 is angina or coronary heart disease and CVDSTRK3 is a stroke. According to the survey and from the chart above, the data shows that Florida state has the highest incidence of cadiovascular disease followed by New York state while Virgin Island has the least reported incidence. This was a vey messy data with several levels of codes to clean and transform, for example, I had to transform the state from the provided codes. I chose thirteen (13) variables from the 275 variables because they makes more sense to me from the lot.
Data for Sustainable Development Goals (SDGs) The SDGs have 17 goals of which the first is to “end poverty in all its forms everywhere”. The authors of the article with the source of this data combined country-specific historical estimates of the distribution of income, using Beta–Lorenz curves, with projections of population changes by age and education attainment level, as well as GDP projections to provide the first set of internally consistent poverty projections for all countries of the world. Making use of demographic and economic projections developed in the context of the Intergovernmental Panel on Climate Change’s Shared Socioeconomic Pathways (SSP), the authors created poverty paths by country up to the year 2030. Extreme poverty was defined as living on less than $1.90 a day, measured in 2011 Purchasing Power Parity prices (PPP). I intend to analyze and perform data mungling, data transformation and visualization with the results of their projections.
The data consist of the following scenrios:
SSP1 = low challenges for both climate change adaptation and mitigation resulting from income growth which does not rely heavily on natural resources and technological change, coupled with low fertility rate and high educational attainment.
SSP2 = benchmark scenario and assumes the continuation of current global socioeconomic trends at the global level.
SSP3 = low economic growth coupled with low educational attainment levels and high population growth at the global level are the main elements of the narrative,which is characterized by high mitigation and adaptation challenges.
SSP4 = narrative of worldwide polarization, with high income countries exhibiting relatively high growth rates of income, while developing economies present low levels of education, high fertility and economic stagnation.
SSP5 = high economic growth coupled with high demand for fossil energy from developing economies, thus increasing global Carbon dioxide emissions.
Importing the data from my GitHub
SDG <- read.csv("https://raw.githubusercontent.com/Emahayz/Data-607-Class/master/data_poverty_gdppc.csv", header = T, sep = ",")
Lets view the new structure
str(SDG)
## 'data.frame': 15980 obs. of 6 variables:
## $ ccode : Factor w/ 188 levels "AFG","AGO","ALB",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 2015 2015 2015 2015 2015 2016 2016 2016 2016 2016 ...
## $ ssp : Factor w/ 5 levels "SSP1","SSP2",..: 1 2 3 4 5 1 2 3 4 5 ...
## $ hc.source : Factor w/ 2 levels "regression","survey": 1 1 1 1 1 1 1 1 1 1 ...
## $ extreme.Poor: int 12698235 12698235 12698235 12698235 12698235 12888301 12888301 12888301 12888301 12888301 ...
## $ gdp.capita : num 1791 1791 1791 1791 1791 ...
head(SDG)
## ccode year ssp hc.source extreme.Poor gdp.capita
## 1 AFG 2015 SSP1 regression 12698235 1790.51
## 2 AFG 2015 SSP2 regression 12698235 1790.51
## 3 AFG 2015 SSP3 regression 12698235 1790.51
## 4 AFG 2015 SSP4 regression 12698235 1790.51
## 5 AFG 2015 SSP5 regression 12698235 1790.51
## 6 AFG 2016 SSP1 regression 12888301 1780.16
Some Visualization- Scatter Plot
ggplot(SDG, aes(x=gdp.capita, y=extreme.Poor,shape=ssp, color=ssp)) +geom_point()+
labs(
title="Global Poverty Projection",
x="Gross Domestic Product", y = "Extreme Poverty"
)
The scatter plot shows that as National GDP decreases, Extreme Poverty Increase with more indication of scenario SSP3 to SSP5. This data seems to be presented in long format,I will Transform the data to wide format
SDGWide <- SDG %>% spread(ssp, extreme.Poor)
Cleaning the data by removing the missing values
SDGWide_New <- na.omit(SDGWide)
View the data again
head(SDGWide_New)
## ccode year hc.source gdp.capita SSP1 SSP2 SSP3 SSP4
## 1 AFG 2015 regression 1790.51 12698235 12698235 12698235 12698235
## 2 AFG 2016 regression 1780.16 12888301 12888301 12888301 12888301
## 3 AFG 2017 regression 1790.68 12789616 12789616 12789616 12789616
## 4 AFG 2018 regression 1812.69 12532599 12532599 12532599 12532599
## 5 AFG 2019 regression 1845.41 12149180 12149180 12149180 12149180
## 6 AFG 2020 regression 1888.56 11666924 11666924 11666924 11666924
## SSP5
## 1 12698235
## 2 12888301
## 3 12789616
## 4 12532599
## 5 12149180
## 6 11666924
The Average GDP per Capita is less than $20 Million ($19,106M)
summary(SDGWide_New$gdp.capita)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 429.8 3841.4 11977.2 19106.0 27007.5 138404.0
Histogram showing GDP per Capita
hist(SDGWide_New$gdp.capita)
Question: What percentage of these projections were made using Regression Model?
table(SDGWide_New$hc.source)
##
## regression survey
## 185 1364
round(prop.table(table(SDGWide_New$hc.source)) *100,digit = 1)
##
## regression survey
## 11.9 88.1
About 12% of the Extreme Poverty Projections were made using Regression model while 88% of the projections were survey based.
Show on a Map the Annual GDP per Capita of Nigeria (NGA) and Uganda (UGA) First I will need to subset this data and create a dataframe for the Nigeria and Uganda
SDGdata1 <- subset(SDGWide_New, SDGWide_New$ccode == "NGA" | SDGWide_New$ccode == "UGA")
SDGdata1
## ccode year hc.source gdp.capita SSP1 SSP2 SSP3 SSP4
## 6412 NGA 2015 survey 5638.93 70203439 70203439 70203439 70203439
## 6413 NGA 2016 survey 5409.98 76901167 76901167 76901167 76901167
## 6414 NGA 2017 survey 5317.09 80989339 80989339 80989339 80989339
## 6415 NGA 2018 survey 5282.29 83881602 83881602 83881602 83881602
## 6416 NGA 2019 survey 5247.83 86845070 86845070 86845070 86845070
## 6417 NGA 2020 survey 5211.18 89949154 89949154 89949154 89949154
## 6418 NGA 2021 survey 5176.03 93102799 93102799 93102799 93102799
## 6419 NGA 2022 survey 5142.19 96313552 96313552 96313552 96313552
## 9043 UGA 2015 survey 1928.27 12490438 12490438 12490438 12490438
## 9044 UGA 2016 survey 1953.63 12638075 12638075 12638075 12638075
## 9045 UGA 2017 survey 1985.99 12712471 12712471 12712471 12712471
## 9046 UGA 2018 survey 2033.55 12624153 12624153 12624153 12624153
## 9047 UGA 2019 survey 2090.75 12432557 12432557 12432557 12432557
## 9048 UGA 2020 survey 2155.42 12162143 12162143 12162143 12162143
## 9049 UGA 2021 survey 2234.61 11748949 11748949 11748949 11748949
## 9050 UGA 2022 survey 2341.04 11084169 11084169 11084169 11084169
## SSP5
## 6412 70203439
## 6413 76901167
## 6414 80989339
## 6415 83881602
## 6416 86845070
## 6417 89949154
## 6418 93102799
## 6419 96313552
## 9043 12490438
## 9044 12638075
## 9045 12712471
## 9046 12624153
## 9047 12432557
## 9048 12162143
## 9049 11748949
## 9050 11084169
Some Visualization- Box Plot of Annual GDP of Nigeria and Uganda
ggplot(SDGdata1, aes(x=ccode, y=gdp.capita, color=ccode)) +
geom_bar(stat="identity", fill="white")+facet_wrap(~year)+
labs(
title="Annual GDP of Nigeria and Uganda",
x="Country Code", y = "GDP per Capita"
)
Write CSV in R
write.csv(SDGdata1, file = "SDG of Nigeria_Uganda.csv")
Banking Promotion for Term Deposit Importing the data from my GitHub
Bank <- read.csv("https://raw.githubusercontent.com/Emahayz/Data-607-Class/master/BankData.csv", header = T, sep = ",")
Lets view the structure and the headers
str(Bank)
## 'data.frame': 4521 obs. of 17 variables:
## $ age : int 30 33 35 30 59 35 36 39 41 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
## $ education : Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
## $ housing : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
## $ day : int 19 11 16 3 5 23 14 6 14 17 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
## $ duration : int 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
## $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
## $ term_deposit: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
head(Bank)
## age job marital education default balance housing loan contact
## 1 30 unemployed married primary no 1787 no no cellular
## 2 33 services married secondary no 4789 yes yes cellular
## 3 35 management single tertiary no 1350 yes no cellular
## 4 30 management married tertiary no 1476 yes yes unknown
## 5 59 blue-collar married secondary no 0 yes no unknown
## 6 35 management single tertiary no 747 no no cellular
## day month duration campaign pdays previous poutcome term_deposit
## 1 19 oct 79 1 -1 0 unknown no
## 2 11 may 220 1 339 4 failure no
## 3 16 apr 185 1 330 1 failure no
## 4 3 jun 199 4 -1 0 unknown no
## 5 5 may 226 1 -1 0 unknown no
## 6 23 feb 141 2 176 3 failure no
table(Bank$term_deposit)
##
## no yes
## 4000 521
round(prop.table(table(Bank$term_deposit))*100,digit = 1)
##
## no yes
## 88.5 11.5
4000 customers do not currently have term deposit, the number corresponds to 89% of the customers
Recoding the Banking crisis variable
Bank$term_deposit <- ifelse(Bank$term_deposit == "yes", 1, 0)
names(Bank)
## [1] "age" "job" "marital" "education"
## [5] "default" "balance" "housing" "loan"
## [9] "contact" "day" "month" "duration"
## [13] "campaign" "pdays" "previous" "poutcome"
## [17] "term_deposit"
str(Bank)
## 'data.frame': 4521 obs. of 17 variables:
## $ age : int 30 33 35 30 59 35 36 39 41 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
## $ education : Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
## $ housing : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
## $ day : int 19 11 16 3 5 23 14 6 14 17 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
## $ duration : int 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
## $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
## $ term_deposit: num 0 0 0 0 0 0 0 0 0 0 ...
Visualization
ggplot(data=Bank, aes(Bank$age)) + geom_histogram(position="identity", alpha=0.5)+
labs(title="Age of Customers",x="Age", y = "Frequency")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Transform the data to wide format
Bank_Wide <- Bank %>% spread(marital, balance)
View the data again
head(Bank_Wide)
## age job education default housing loan contact day month
## 1 30 unemployed primary no no no cellular 19 oct
## 2 33 services secondary no yes yes cellular 11 may
## 3 35 management tertiary no yes no cellular 16 apr
## 4 30 management tertiary no yes yes unknown 3 jun
## 5 59 blue-collar secondary no yes no unknown 5 may
## 6 35 management tertiary no no no cellular 23 feb
## duration campaign pdays previous poutcome term_deposit divorced married
## 1 79 1 -1 0 unknown 0 NA 1787
## 2 220 1 339 4 failure 0 NA 4789
## 3 185 1 330 1 failure 0 NA NA
## 4 199 4 -1 0 unknown 0 NA 1476
## 5 226 1 -1 0 unknown 0 NA 0
## 6 141 2 176 3 failure 0 NA NA
## single
## 1 NA
## 2 NA
## 3 1350
## 4 NA
## 5 NA
## 6 747
Write CSV in R
write.csv(Bank_Wide, file = "Banking_Promotion.csv")