Data607_Project2

Data 1

The metropolitan or micropolitan statistics areas (MMSAs) data based were selected among the Survey of Metropolitan/Micropolitan Area Risk Trends (SMART) BRFSS MMSA data. The SMART comprised of 486238 or more participants who responded in the 2016 BRFSS combined landline telephone and cellular telephone survey. The aggregate Behavioral Risk Factor Surveillance Survey (BRFSS) landline and cell phone dataset was built from the landline and cell phone data submitted for 2016 and includes data for 50 states, the District of Columbia, Guam, and Puerto Rico. The dataset is available at https://www.cdc.gov/brfss/annual_data/annual_2016.html The file for the combined landline and cell phone data set was exported from SAS V9.3 in the XPT transport format. This file contains 275 variables, this dataset is very untidy, I was able to import the XPT transport format to R. Some of the variable labels are truncated in the process of converting the XPT format. Tidying and transforming this data is required for proper analysis.

getwd()

## [1] "C:/Users/Emahayz_Pro/Desktop/Data_Science/Data 607/Week6"

setwd("C:/Users/Emahayz_Pro/Desktop/Data_Science/Data 607/Week6")

library(foreign)
library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

This data is too large, GitHub rejected it due to size limitation Dataset has 486,303 observations with 275 variables

CDC <- read.xport("LLCP2016.XPT", sep = ",") 
str(CDC)

## 'data.frame':    486303 obs. of  275 variables:
##  $ X_STATE  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ FMONTH   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ IDATE    : Factor w/ 419 levels "01012017","01022017",..: 10 18 8 12 6 10 12 10 10 52 ...
##  $ IMONTH   : Factor w/ 12 levels "01","02","03",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ IDAY     : Factor w/ 31 levels "01","02","03",..: 7 11 6 8 5 7 8 7 7 28 ...
##  $ IYEAR    : Factor w/ 2 levels "2016","2017": 1 1 1 1 1 1 1 1 1 1 ...
##  $ DISPCODE : num  1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 ...
##  $ SEQNO    : Factor w/ 36955 levels "2016000001","2016000002",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ X_PSU    : num  2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
##  $ CTELENM1 : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ PVTRESD1 : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ COLGHOUS : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ STATERES : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ CELLFON4 : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ LADULT   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ NUMADULT : num  2 2 1 2 3 3 2 2 3 1 ...
##  $ NUMMEN   : num  1 1 0 1 2 1 1 1 1 1 ...
##  $ NUMWOMEN : num  1 1 1 1 1 2 1 1 2 0 ...
##  $ CTELNUM1 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ CELLFON5 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ CADULT   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ PVTRESD3 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ CCLGHOUS : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ CSTATE1  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ LANDLINE : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ HHADULT  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ GENHLTH  : num  1 2 3 3 5 3 3 3 2 2 ...
##  $ PHYSHLTH : num  88 88 88 88 10 88 88 88 88 88 ...
##  $ MENTHLTH : num  88 88 1 88 3 88 88 88 88 88 ...
##  $ POORHLTH : num  NA NA 88 NA 3 NA NA NA NA NA ...
##  $ HLTHPLN1 : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ PERSDOC2 : num  2 1 2 1 1 1 2 1 1 1 ...
##  $ MEDCOST  : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ CHECKUP1 : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ EXERANY2 : num  1 1 1 1 1 2 1 1 1 1 ...
##  $ SLEPTIM1 : num  5 6 9 7 6 8 7 8 8 7 ...
##  $ CVDINFR4 : num  2 2 1 2 2 2 2 2 2 2 ...
##  $ CVDCRHD4 : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ CVDSTRK3 : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ ASTHMA3  : num  2 2 2 2 1 2 2 1 2 2 ...
##  $ ASTHNOW  : num  NA NA NA NA 2 NA NA 1 NA NA ...
##  $ CHCSCNCR : num  2 2 1 1 2 2 1 1 2 2 ...
##  $ CHCOCNCR : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ CHCCOPD1 : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ HAVARTH3 : num  2 2 1 1 2 2 1 1 2 2 ...
##  $ ADDEPEV2 : num  2 1 2 2 2 2 2 1 2 2 ...
##  $ CHCKIDNY : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ DIABETE3 : num  3 3 3 1 3 3 3 1 3 3 ...
##  $ DIABAGE2 : num  NA NA NA 62 NA NA NA 75 NA NA ...
##  $ LASTDEN3 : num  1 1 2 1 1 3 1 2 1 1 ...
##  $ RMVTETH3 : num  1 2 8 2 8 1 8 8 8 1 ...
##  $ SEX      : num  1 2 2 1 1 2 2 2 2 1 ...
##  $ MARITAL  : num  1 1 3 1 5 3 1 1 1 5 ...
##  $ EDUCA    : num  4 4 5 6 4 4 4 5 5 4 ...
##  $ RENTHOM1 : num  1 1 1 1 3 1 1 1 1 1 ...
##  $ NUMHHOL2 : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ NUMPHON2 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ CPDEMO1  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ VETERAN3 : num  1 2 2 1 2 2 2 2 2 2 ...
##  $ EMPLOY1  : num  1 5 7 7 1 1 7 7 1 2 ...
##  $ CHILDREN : num  3 88 88 88 1 88 88 88 88 88 ...
##  $ INCOME2  : num  5 7 7 7 77 99 77 7 8 6 ...
##  $ INTERNET : num  1 1 1 1 1 1 1 1 1 2 ...
##  $ WEIGHT2  : num  135 153 120 220 125 ...
##  $ HEIGHT3  : num  508 503 505 601 506 507 501 503 506 600 ...
##  $ PREGNANT : num  NA NA NA NA NA NA NA NA 2 NA ...
##  $ DEAF     : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ BLIND    : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ DECIDE   : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ DIFFWALK : num  2 2 2 2 2 2 2 1 2 2 ...
##  $ DIFFDRES : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ DIFFALON : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ SMOKE100 : num  2 2 1 1 2 2 1 2 2 2 ...
##  $ SMOKDAY2 : num  NA NA 1 3 NA NA 3 NA NA NA ...
##  $ STOPSMK2 : num  NA NA 2 NA NA NA NA NA NA NA ...
##  $ LASTSMK2 : num  NA NA NA 7 NA NA 7 NA NA NA ...
##  $ USENOW3  : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ ECIGARET : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ ECIGNOW  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ ALCDAY5  : num  888 888 101 888 888 888 205 888 888 201 ...
##  $ AVEDRNK2 : num  NA NA 1 NA NA NA 1 NA NA 1 ...
##  $ DRNK3GE5 : num  NA NA 88 NA NA NA 88 NA NA 88 ...
##  $ MAXDRNKS : num  NA NA 2 NA NA NA 2 NA NA 1 ...
##  $ FLUSHOT6 : num  1 2 1 1 2 2 1 1 1 1 ...
##  $ FLSHTMY2 : num  12015 NA 102015 102015 NA ...
##  $ PNEUVAC3 : num  7 7 1 1 2 1 1 1 2 2 ...
##  $ TETANUS  : num  3 4 4 4 4 4 7 4 1 3 ...
##  $ FALL12MN : num  NA 1 88 88 NA 1 88 1 NA 3 ...
##  $ FALLINJ2 : num  NA 88 NA NA NA 88 NA 1 NA 88 ...
##  $ SEATBELT : num  5 1 1 5 1 1 1 1 1 1 ...
##  $ DRNKDRI2 : num  NA NA 88 NA NA NA 88 NA NA 88 ...
##  $ HADMAM   : num  NA 1 1 NA NA 1 1 1 1 NA ...
##  $ HOWLONG  : num  NA 1 4 NA NA 1 2 2 1 NA ...
##  $ HADPAP2  : num  NA 1 1 NA NA 1 1 1 1 NA ...
##  $ LASTPAP2 : num  NA 1 5 NA NA 5 7 5 1 NA ...
##  $ HPVTEST  : num  NA 2 2 NA NA 2 2 2 1 NA ...
##  $ HPLSTTST : num  NA NA NA NA NA NA NA NA 1 NA ...
##  $ HADHYST2 : num  NA 1 2 NA NA 1 1 1 2 NA ...
##  $ PCPSAAD2 : num  2 NA NA 1 NA NA NA NA NA 1 ...
##   [list output truncated]

Transforming the state variable to factor

CDC$STATE[CDC$X_STATE=="1"] <- "Alabama"
CDC$STATE[CDC$X_STATE=="2"] <- "Alaska"
CDC$STATE[CDC$X_STATE=="4"] <- "Arizona"
CDC$STATE[CDC$X_STATE=="5"] <- "Arkansas"
CDC$STATE[CDC$X_STATE=="6"] <- "California"
CDC$STATE[CDC$X_STATE=="8"] <- "Colorado"
CDC$STATE[CDC$X_STATE=="9"] <- "Connecticut"
CDC$STATE[CDC$X_STATE=="10"] <- "Delaware"
CDC$STATE[CDC$X_STATE=="11"] <- "District of Columbia"
CDC$STATE[CDC$X_STATE=="12"] <- "Florida"
CDC$STATE[CDC$X_STATE=="13"] <- "Georgia"
CDC$STATE[CDC$X_STATE=="15"] <- "Hawaii"
CDC$STATE[CDC$X_STATE=="16"] <- "Idaho"
CDC$STATE[CDC$X_STATE=="17"] <- "Illinois"
CDC$STATE[CDC$X_STATE=="18"] <- "Indiana"
CDC$STATE[CDC$X_STATE=="19"] <- "Iowa"
CDC$STATE[CDC$X_STATE=="20"] <- "Kansas"
CDC$STATE[CDC$X_STATE=="21"] <- "Kentucky"
CDC$STATE[CDC$X_STATE=="22"] <- "Louisiana"
CDC$STATE[CDC$X_STATE=="23"] <- "Maine"
CDC$STATE[CDC$X_STATE=="24"] <- "Maryland"
CDC$STATE[CDC$X_STATE=="25"] <- "Massachusetts"
CDC$STATE[CDC$X_STATE=="26"] <- "Michigan"
CDC$STATE[CDC$X_STATE=="27"] <- "Minnesota"
CDC$STATE[CDC$X_STATE=="28"] <- "Mississippi"
CDC$STATE[CDC$X_STATE=="29"] <- "Missouri"
CDC$STATE[CDC$X_STATE=="30"] <- "Montana"
CDC$STATE[CDC$X_STATE=="31"] <- "Nebraska"
CDC$STATE[CDC$X_STATE=="32"] <- "Nevada"
CDC$STATE[CDC$X_STATE=="33"] <- "New Hampshire"
CDC$STATE[CDC$X_STATE=="34"] <- "New Jersey"
CDC$STATE[CDC$X_STATE=="35"] <- "New Mexico"
CDC$STATE[CDC$X_STATE=="36"] <- "New York"
CDC$STATE[CDC$X_STATE=="37"] <- "North Carolina"
CDC$STATE[CDC$X_STATE=="38"] <- "North Dakota"
CDC$STATE[CDC$X_STATE=="39"] <- "Ohio"
CDC$STATE[CDC$X_STATE=="40"] <- "Oklahoma"
CDC$STATE[CDC$X_STATE=="41"] <- "Oregon"
CDC$STATE[CDC$X_STATE=="42"] <- "Pennsylvania"
CDC$STATE[CDC$X_STATE=="44"] <- "Rhode Island"
CDC$STATE[CDC$X_STATE=="45"] <- "South Carolina"
CDC$STATE[CDC$X_STATE=="46"] <- "South Dakota"
CDC$STATE[CDC$X_STATE=="47"] <- "Tennessee"
CDC$STATE[CDC$X_STATE=="48"] <- "Texas"
CDC$STATE[CDC$X_STATE=="49"] <- "Utah"
CDC$STATE[CDC$X_STATE=="50"] <- "Vermont"
CDC$STATE[CDC$X_STATE=="51"] <- "Virginia"
CDC$STATE[CDC$X_STATE=="53"] <- "Washington"
CDC$STATE[CDC$X_STATE=="54"] <- "West Virginia"
CDC$STATE[CDC$X_STATE=="55"] <- "Wisconsin"
CDC$STATE[CDC$X_STATE=="56"] <- "Wyoming"
CDC$STATE[CDC$X_STATE=="66"] <- "Guam"
CDC$STATE[CDC$X_STATE=="72"] <- "Puerto Rico"
CDC$STATE[CDC$X_STATE=="78"] <- "Virgin Islands"

Creating a subset of the data: I’m using dplyr for subseting with 11 variables

CDC_New <- CDC%>%
  select(STATE, VETERAN3,GENHLTH,CHECKUP1,CVDINFR4,CVDCRHD4,CVDSTRK3,SEX, MARITAL,EDUCA,INCOME2,WEIGHT2,SMOKDAY2)

Data Cleaning and Preprocessing Cleaning the data by removing NA

CDC_New1 <- na.omit(CDC_New)

Lets view the new structure

str(CDC_New1)

## 'data.frame':    204093 obs. of  13 variables:
##  $ STATE   : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ VETERAN3: num  2 1 2 2 2 2 1 2 2 2 ...
##  $ GENHLTH : num  3 3 3 4 3 2 3 3 3 4 ...
##  $ CHECKUP1: num  1 1 1 1 3 3 1 2 1 2 ...
##  $ CVDINFR4: num  1 2 2 2 2 2 2 2 2 2 ...
##  $ CVDCRHD4: num  2 2 2 2 2 2 2 2 2 2 ...
##  $ CVDSTRK3: num  2 2 2 2 2 2 2 2 1 2 ...
##  $ SEX     : num  2 1 2 2 2 2 1 1 1 2 ...
##  $ MARITAL : num  3 1 1 2 1 1 3 1 1 2 ...
##  $ EDUCA   : num  5 6 4 3 6 6 6 3 5 6 ...
##  $ INCOME2 : num  7 7 77 1 3 8 99 5 8 77 ...
##  $ WEIGHT2 : num  120 220 130 145 215 141 190 140 165 200 ...
##  $ SMOKDAY2: num  1 3 3 1 3 3 3 2 3 3 ...
##  - attr(*, "na.action")= 'omit' Named int  1 2 5 6 8 9 10 11 12 13 ...
##   ..- attr(*, "names")= chr  "1" "2" "5" "6" ...

Tidy the data using the gather function of tidyr package to transform the dataframe from wide to long format

long_CDC_New1 <- CDC_New1 %>% gather(CVD, PRESENT, CVDINFR4,CVDCRHD4,CVDSTRK3)
head(long_CDC_New1, 24)

##      STATE VETERAN3 GENHLTH CHECKUP1 SEX MARITAL EDUCA INCOME2 WEIGHT2
## 1  Alabama        2       3        1   2       3     5       7     120
## 2  Alabama        1       3        1   1       1     6       7     220
## 3  Alabama        2       3        1   2       1     4      77     130
## 4  Alabama        2       4        1   2       2     3       1     145
## 5  Alabama        2       3        3   2       1     6       3     215
## 6  Alabama        2       2        3   2       1     6       8     141
## 7  Alabama        1       3        1   1       3     6      99     190
## 8  Alabama        2       3        2   1       1     3       5     140
## 9  Alabama        2       3        1   1       1     5       8     165
## 10 Alabama        2       4        2   2       2     6      77     200
## 11 Alabama        2       2        1   2       3     5       6     120
## 12 Alabama        2       3        1   1       6     4       4     280
## 13 Alabama        2       3        1   1       1     4       5     255
## 14 Alabama        2       4        1   2       1     3       3     130
## 15 Alabama        2       3        1   1       2     5       2     170
## 16 Alabama        2       2        2   2       1     6       8     225
## 17 Alabama        2       1        3   2       1     4       8     140
## 18 Alabama        2       3        1   1       2     5       6     238
## 19 Alabama        2       1        1   2       3     5      77     148
## 20 Alabama        2       3        3   1       2     4       5     190
## 21 Alabama        2       5        1   1       2     5       5     240
## 22 Alabama        1       1        1   1       1     6      77     190
## 23 Alabama        2       3        1   1       9     3      99     155
## 24 Alabama        2       2        1   1       3     3       3     230
##    SMOKDAY2      CVD PRESENT
## 1         1 CVDINFR4       1
## 2         3 CVDINFR4       2
## 3         3 CVDINFR4       2
## 4         1 CVDINFR4       2
## 5         3 CVDINFR4       2
## 6         3 CVDINFR4       2
## 7         3 CVDINFR4       2
## 8         2 CVDINFR4       2
## 9         3 CVDINFR4       2
## 10        3 CVDINFR4       2
## 11        3 CVDINFR4       2
## 12        2 CVDINFR4       7
## 13        3 CVDINFR4       2
## 14        2 CVDINFR4       2
## 15        2 CVDINFR4       2
## 16        3 CVDINFR4       2
## 17        3 CVDINFR4       2
## 18        1 CVDINFR4       2
## 19        3 CVDINFR4       2
## 20        3 CVDINFR4       2
## 21        1 CVDINFR4       2
## 22        3 CVDINFR4       2
## 23        2 CVDINFR4       2
## 24        1 CVDINFR4       2

Vizualizing the CVD by state

ggplot() + geom_bar(aes(y = CVD, x = STATE,fill =CVD), data = long_CDC_New1,stat="identity")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  labs( x="States", y="Cadiovascular Disease")

Write CSV in R

write.csv(long_CDC_New1, file = "CDC_BRFSS_Survey.csv")

Conclusion

I used a long format for my final dataframe where I compressed the 3 columns of cadiovascular disease (CVDINFR4,CVDCRHD4,CVDSTRK3) into one column cadiovascular diseases (CVD) where CVDINFR4 is heart attack also called a myocardial infarction, CVDCRHD4 is angina or coronary heart disease and CVDSTRK3 is a stroke. According to the survey and from the chart above, the data shows that Florida state has the highest incidence of cadiovascular disease followed by New York state while Virgin Island has the least reported incidence. This was a vey messy data with several levels of codes to clean and transform, for example, I had to transform the state from the provided codes. I chose thirteen (13) variables from the 275 variables because they makes more sense to me from the lot.

Data 2

Data for Sustainable Development Goals (SDGs) The SDGs have 17 goals of which the first is to “end poverty in all its forms everywhere”. The authors of the article with the source of this data combined country-specific historical estimates of the distribution of income, using Beta–Lorenz curves, with projections of population changes by age and education attainment level, as well as GDP projections to provide the first set of internally consistent poverty projections for all countries of the world. Making use of demographic and economic projections developed in the context of the Intergovernmental Panel on Climate Change’s Shared Socioeconomic Pathways (SSP), the authors created poverty paths by country up to the year 2030. Extreme poverty was defined as living on less than $1.90 a day, measured in 2011 Purchasing Power Parity prices (PPP). I intend to analyze and perform data mungling, data transformation and visualization with the results of their projections.

The data consist of the following scenrios:

SSP1 = low challenges for both climate change adaptation and mitigation resulting from income growth which does not rely heavily on natural resources and technological change, coupled with low fertility rate and high educational attainment.

SSP2 = benchmark scenario and assumes the continuation of current global socioeconomic trends at the global level.

SSP3 = low economic growth coupled with low educational attainment levels and high population growth at the global level are the main elements of the narrative,which is characterized by high mitigation and adaptation challenges.

SSP4 = narrative of worldwide polarization, with high income countries exhibiting relatively high growth rates of income, while developing economies present low levels of education, high fertility and economic stagnation.

SSP5 = high economic growth coupled with high demand for fossil energy from developing economies, thus increasing global Carbon dioxide emissions.

Importing the data from my GitHub

SDG <- read.csv("https://raw.githubusercontent.com/Emahayz/Data-607-Class/master/data_poverty_gdppc.csv", header = T, sep = ",")

Lets view the new structure

str(SDG)

## 'data.frame':    15980 obs. of  6 variables:
##  $ ccode       : Factor w/ 188 levels "AFG","AGO","ALB",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year        : int  2015 2015 2015 2015 2015 2016 2016 2016 2016 2016 ...
##  $ ssp         : Factor w/ 5 levels "SSP1","SSP2",..: 1 2 3 4 5 1 2 3 4 5 ...
##  $ hc.source   : Factor w/ 2 levels "regression","survey": 1 1 1 1 1 1 1 1 1 1 ...
##  $ extreme.Poor: int  12698235 12698235 12698235 12698235 12698235 12888301 12888301 12888301 12888301 12888301 ...
##  $ gdp.capita  : num  1791 1791 1791 1791 1791 ...

head(SDG)

##   ccode year  ssp  hc.source extreme.Poor gdp.capita
## 1   AFG 2015 SSP1 regression     12698235    1790.51
## 2   AFG 2015 SSP2 regression     12698235    1790.51
## 3   AFG 2015 SSP3 regression     12698235    1790.51
## 4   AFG 2015 SSP4 regression     12698235    1790.51
## 5   AFG 2015 SSP5 regression     12698235    1790.51
## 6   AFG 2016 SSP1 regression     12888301    1780.16

Some Visualization- Scatter Plot

ggplot(SDG, aes(x=gdp.capita, y=extreme.Poor,shape=ssp, color=ssp)) +geom_point()+
      labs(
        title="Global Poverty Projection",
         x="Gross Domestic Product", y = "Extreme Poverty"
      )

The scatter plot shows that as National GDP decreases, Extreme Poverty Increase with more indication of scenario SSP3 to SSP5. This data seems to be presented in long format,I will Transform the data to wide format

SDGWide <- SDG %>% spread(ssp, extreme.Poor)

Cleaning the data by removing the missing values

SDGWide_New <- na.omit(SDGWide)

View the data again

head(SDGWide_New)

##   ccode year  hc.source gdp.capita     SSP1     SSP2     SSP3     SSP4
## 1   AFG 2015 regression    1790.51 12698235 12698235 12698235 12698235
## 2   AFG 2016 regression    1780.16 12888301 12888301 12888301 12888301
## 3   AFG 2017 regression    1790.68 12789616 12789616 12789616 12789616
## 4   AFG 2018 regression    1812.69 12532599 12532599 12532599 12532599
## 5   AFG 2019 regression    1845.41 12149180 12149180 12149180 12149180
## 6   AFG 2020 regression    1888.56 11666924 11666924 11666924 11666924
##       SSP5
## 1 12698235
## 2 12888301
## 3 12789616
## 4 12532599
## 5 12149180
## 6 11666924

The Average GDP per Capita is less than $20 Million ($19,106M)

summary(SDGWide_New$gdp.capita)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    429.8   3841.4  11977.2  19106.0  27007.5 138404.0

Histogram showing GDP per Capita

hist(SDGWide_New$gdp.capita)

Question: What percentage of these projections were made using Regression Model?

table(SDGWide_New$hc.source)

## 
## regression     survey 
##        185       1364

round(prop.table(table(SDGWide_New$hc.source)) *100,digit = 1)

## 
## regression     survey 
##       11.9       88.1

About 12% of the Extreme Poverty Projections were made using Regression model while 88% of the projections were survey based.

Show on a Map the Annual GDP per Capita of Nigeria (NGA) and Uganda (UGA) First I will need to subset this data and create a dataframe for the Nigeria and Uganda

SDGdata1 <- subset(SDGWide_New, SDGWide_New$ccode == "NGA" | SDGWide_New$ccode == "UGA")
SDGdata1

##      ccode year hc.source gdp.capita     SSP1     SSP2     SSP3     SSP4
## 6412   NGA 2015    survey    5638.93 70203439 70203439 70203439 70203439
## 6413   NGA 2016    survey    5409.98 76901167 76901167 76901167 76901167
## 6414   NGA 2017    survey    5317.09 80989339 80989339 80989339 80989339
## 6415   NGA 2018    survey    5282.29 83881602 83881602 83881602 83881602
## 6416   NGA 2019    survey    5247.83 86845070 86845070 86845070 86845070
## 6417   NGA 2020    survey    5211.18 89949154 89949154 89949154 89949154
## 6418   NGA 2021    survey    5176.03 93102799 93102799 93102799 93102799
## 6419   NGA 2022    survey    5142.19 96313552 96313552 96313552 96313552
## 9043   UGA 2015    survey    1928.27 12490438 12490438 12490438 12490438
## 9044   UGA 2016    survey    1953.63 12638075 12638075 12638075 12638075
## 9045   UGA 2017    survey    1985.99 12712471 12712471 12712471 12712471
## 9046   UGA 2018    survey    2033.55 12624153 12624153 12624153 12624153
## 9047   UGA 2019    survey    2090.75 12432557 12432557 12432557 12432557
## 9048   UGA 2020    survey    2155.42 12162143 12162143 12162143 12162143
## 9049   UGA 2021    survey    2234.61 11748949 11748949 11748949 11748949
## 9050   UGA 2022    survey    2341.04 11084169 11084169 11084169 11084169
##          SSP5
## 6412 70203439
## 6413 76901167
## 6414 80989339
## 6415 83881602
## 6416 86845070
## 6417 89949154
## 6418 93102799
## 6419 96313552
## 9043 12490438
## 9044 12638075
## 9045 12712471
## 9046 12624153
## 9047 12432557
## 9048 12162143
## 9049 11748949
## 9050 11084169

Some Visualization- Box Plot of Annual GDP of Nigeria and Uganda

ggplot(SDGdata1, aes(x=ccode, y=gdp.capita, color=ccode)) +
  geom_bar(stat="identity", fill="white")+facet_wrap(~year)+
  labs(
    title="Annual GDP of Nigeria and Uganda",
    x="Country Code", y = "GDP per Capita"
  )

Write CSV in R

write.csv(SDGdata1, file = "SDG of Nigeria_Uganda.csv")

Dataset 3

Banking Promotion for Term Deposit Importing the data from my GitHub

Bank <- read.csv("https://raw.githubusercontent.com/Emahayz/Data-607-Class/master/BankData.csv", header = T, sep = ",")

Lets view the structure and the headers

str(Bank)

## 'data.frame':    4521 obs. of  17 variables:
##  $ age         : int  30 33 35 30 59 35 36 39 41 43 ...
##  $ job         : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
##  $ marital     : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
##  $ education   : Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
##  $ default     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ balance     : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
##  $ housing     : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
##  $ loan        : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
##  $ contact     : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
##  $ day         : int  19 11 16 3 5 23 14 6 14 17 ...
##  $ month       : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
##  $ duration    : int  79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign    : int  1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays       : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous    : int  0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome    : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
##  $ term_deposit: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

head(Bank)

##   age         job marital education default balance housing loan  contact
## 1  30  unemployed married   primary      no    1787      no   no cellular
## 2  33    services married secondary      no    4789     yes  yes cellular
## 3  35  management  single  tertiary      no    1350     yes   no cellular
## 4  30  management married  tertiary      no    1476     yes  yes  unknown
## 5  59 blue-collar married secondary      no       0     yes   no  unknown
## 6  35  management  single  tertiary      no     747      no   no cellular
##   day month duration campaign pdays previous poutcome term_deposit
## 1  19   oct       79        1    -1        0  unknown           no
## 2  11   may      220        1   339        4  failure           no
## 3  16   apr      185        1   330        1  failure           no
## 4   3   jun      199        4    -1        0  unknown           no
## 5   5   may      226        1    -1        0  unknown           no
## 6  23   feb      141        2   176        3  failure           no

table(Bank$term_deposit)

## 
##   no  yes 
## 4000  521

round(prop.table(table(Bank$term_deposit))*100,digit = 1)

## 
##   no  yes 
## 88.5 11.5

4000 customers do not currently have term deposit, the number corresponds to 89% of the customers

Recoding the Banking crisis variable

Bank$term_deposit <- ifelse(Bank$term_deposit == "yes", 1, 0)
names(Bank)

##  [1] "age"          "job"          "marital"      "education"   
##  [5] "default"      "balance"      "housing"      "loan"        
##  [9] "contact"      "day"          "month"        "duration"    
## [13] "campaign"     "pdays"        "previous"     "poutcome"    
## [17] "term_deposit"

str(Bank)

## 'data.frame':    4521 obs. of  17 variables:
##  $ age         : int  30 33 35 30 59 35 36 39 41 43 ...
##  $ job         : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
##  $ marital     : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
##  $ education   : Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
##  $ default     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ balance     : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
##  $ housing     : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
##  $ loan        : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
##  $ contact     : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
##  $ day         : int  19 11 16 3 5 23 14 6 14 17 ...
##  $ month       : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
##  $ duration    : int  79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign    : int  1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays       : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous    : int  0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome    : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
##  $ term_deposit: num  0 0 0 0 0 0 0 0 0 0 ...

Visualization

ggplot(data=Bank, aes(Bank$age)) + geom_histogram(position="identity", alpha=0.5)+
  labs(title="Age of Customers",x="Age", y = "Frequency")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Transform the data to wide format

Bank_Wide <- Bank %>% spread(marital, balance)

View the data again

head(Bank_Wide)

##   age         job education default housing loan  contact day month
## 1  30  unemployed   primary      no      no   no cellular  19   oct
## 2  33    services secondary      no     yes  yes cellular  11   may
## 3  35  management  tertiary      no     yes   no cellular  16   apr
## 4  30  management  tertiary      no     yes  yes  unknown   3   jun
## 5  59 blue-collar secondary      no     yes   no  unknown   5   may
## 6  35  management  tertiary      no      no   no cellular  23   feb
##   duration campaign pdays previous poutcome term_deposit divorced married
## 1       79        1    -1        0  unknown            0       NA    1787
## 2      220        1   339        4  failure            0       NA    4789
## 3      185        1   330        1  failure            0       NA      NA
## 4      199        4    -1        0  unknown            0       NA    1476
## 5      226        1    -1        0  unknown            0       NA       0
## 6      141        2   176        3  failure            0       NA      NA
##   single
## 1     NA
## 2     NA
## 3   1350
## 4     NA
## 5     NA
## 6    747

Write CSV in R

write.csv(Bank_Wide, file = "Banking_Promotion.csv")

Data607_Project2

Emmanuel Hayble-Gomes

10/06/2019

Data 1

Transforming the state variable to factor

Conclusion

Data 2

Dataset 3