Data607_Tidyverse Project

Tidyverse Assignment

Task

Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. (25 points)

Extend an Existing Example. Using one of your classmate’s examples (as created above), extend his or her example with additional annotated code. (15 points)

The dataset: The data source for this project is from Kaggle using the link below: https://www.kaggle.com/theworldbank/sustainable-development-goals

I’m familiar with this dataset, I intend to analyze and perform data mungling, data transformation and visualization with the Tidyverse package.

The data consist of the following scenrios:

SSP1 = low challenges for both climate change adaptation and mitigation resulting from income growth which does not rely heavily on natural resources and technological change, coupled with low fertility rate and high educational attainment.

SSP2 = benchmark scenario and assumes the continuation of current global socioeconomic trends at the global level.

SSP3 = low economic growth coupled with low educational attainment levels and high population growth at the global level are the main elements of the narrative,which is characterized by high mitigation and adaptation challenges.

SSP4 = narrative of worldwide polarization, with high income countries exhibiting relatively high growth rates of income, while developing economies present low levels of education, high fertility and economic stagnation.

SSP5 = high economic growth coupled with high demand for fossil energy from developing economies, thus increasing global Carbon dioxide emissions.

Load the Tidyverse Package

library(tidyverse)

## -- Attaching packages ------------------------------------------------------------------------ tidyverse 1.2.1 --

## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts --------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Importing the data from my GitHub

SDG <- read.csv("https://raw.githubusercontent.com/Emahayz/Data-607-Class/master/data_poverty_gdppc.csv", header = T, sep = ",")

Lets view the new structure

str(SDG)

## 'data.frame':    15980 obs. of  6 variables:
##  $ ccode       : Factor w/ 188 levels "AFG","AGO","ALB",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year        : int  2015 2015 2015 2015 2015 2016 2016 2016 2016 2016 ...
##  $ ssp         : Factor w/ 5 levels "SSP1","SSP2",..: 1 2 3 4 5 1 2 3 4 5 ...
##  $ hc.source   : Factor w/ 2 levels "regression","survey": 1 1 1 1 1 1 1 1 1 1 ...
##  $ extreme.Poor: int  12698235 12698235 12698235 12698235 12698235 12888301 12888301 12888301 12888301 12888301 ...
##  $ gdp.capita  : num  1791 1791 1791 1791 1791 ...

head(SDG)

##   ccode year  ssp  hc.source extreme.Poor gdp.capita
## 1   AFG 2015 SSP1 regression     12698235    1790.51
## 2   AFG 2015 SSP2 regression     12698235    1790.51
## 3   AFG 2015 SSP3 regression     12698235    1790.51
## 4   AFG 2015 SSP4 regression     12698235    1790.51
## 5   AFG 2015 SSP5 regression     12698235    1790.51
## 6   AFG 2016 SSP1 regression     12888301    1780.16

Some Visualization- Scatter Plot

ggplot(SDG, aes(x=gdp.capita, y=extreme.Poor,shape=ssp, color=ssp)) +geom_point()+
      labs(
        title="Global Poverty Projection",
         x="Gross Domestic Product", y = "Extreme Poverty"
      )

The scatter plot shows that as National GDP decreases, Extreme Poverty Increases with more indication of scenario SSP3 to SSP5.

This data seems to be presented in long format,I will Transform the data to wide format

SDGWide <- SDG %>% spread(ssp, extreme.Poor)

Cleaning the data by removing the missing values

SDGWide_New <- na.omit(SDGWide)

View the data again

head(SDGWide_New)

##   ccode year  hc.source gdp.capita     SSP1     SSP2     SSP3     SSP4
## 1   AFG 2015 regression    1790.51 12698235 12698235 12698235 12698235
## 2   AFG 2016 regression    1780.16 12888301 12888301 12888301 12888301
## 3   AFG 2017 regression    1790.68 12789616 12789616 12789616 12789616
## 4   AFG 2018 regression    1812.69 12532599 12532599 12532599 12532599
## 5   AFG 2019 regression    1845.41 12149180 12149180 12149180 12149180
## 6   AFG 2020 regression    1888.56 11666924 11666924 11666924 11666924
##       SSP5
## 1 12698235
## 2 12888301
## 3 12789616
## 4 12532599
## 5 12149180
## 6 11666924

The Average GDP per Capita is less than $20 Million ($19,106M)

summary(SDGWide_New$gdp.capita)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    429.8   3841.4  11977.2  19106.0  27007.5 138404.0

Histogram showing GDP per Capita

hist(SDGWide_New$gdp.capita)

Question: What percentage of these projections were made using Regression Model?

table(SDGWide_New$hc.source)

## 
## regression     survey 
##        185       1364

round(prop.table(table(SDGWide_New$hc.source)) *100,digit = 1)

## 
## regression     survey 
##       11.9       88.1

About 12% of the Extreme Poverty Projections were made using Regression model while 88% of the projections were survey based.

Show on a Map the Annual GDP per Capita of Nigeria (NGA) and Uganda (UGA) First I will need to subset this data and create a dataframe for the Nigeria and Uganda

SDGdata1 <- subset(SDGWide_New, SDGWide_New$ccode == "NGA" | SDGWide_New$ccode == "UGA")
SDGdata1

##      ccode year hc.source gdp.capita     SSP1     SSP2     SSP3     SSP4
## 6412   NGA 2015    survey    5638.93 70203439 70203439 70203439 70203439
## 6413   NGA 2016    survey    5409.98 76901167 76901167 76901167 76901167
## 6414   NGA 2017    survey    5317.09 80989339 80989339 80989339 80989339
## 6415   NGA 2018    survey    5282.29 83881602 83881602 83881602 83881602
## 6416   NGA 2019    survey    5247.83 86845070 86845070 86845070 86845070
## 6417   NGA 2020    survey    5211.18 89949154 89949154 89949154 89949154
## 6418   NGA 2021    survey    5176.03 93102799 93102799 93102799 93102799
## 6419   NGA 2022    survey    5142.19 96313552 96313552 96313552 96313552
## 9043   UGA 2015    survey    1928.27 12490438 12490438 12490438 12490438
## 9044   UGA 2016    survey    1953.63 12638075 12638075 12638075 12638075
## 9045   UGA 2017    survey    1985.99 12712471 12712471 12712471 12712471
## 9046   UGA 2018    survey    2033.55 12624153 12624153 12624153 12624153
## 9047   UGA 2019    survey    2090.75 12432557 12432557 12432557 12432557
## 9048   UGA 2020    survey    2155.42 12162143 12162143 12162143 12162143
## 9049   UGA 2021    survey    2234.61 11748949 11748949 11748949 11748949
## 9050   UGA 2022    survey    2341.04 11084169 11084169 11084169 11084169
##          SSP5
## 6412 70203439
## 6413 76901167
## 6414 80989339
## 6415 83881602
## 6416 86845070
## 6417 89949154
## 6418 93102799
## 6419 96313552
## 9043 12490438
## 9044 12638075
## 9045 12712471
## 9046 12624153
## 9047 12432557
## 9048 12162143
## 9049 11748949
## 9050 11084169

Some Visualization- Box Plot of Annual GDP of Nigeria and Uganda

ggplot(SDGdata1, aes(x=ccode, y=gdp.capita, color=ccode)) +
  geom_bar(stat="identity", fill="white")+facet_wrap(~year)+
  labs(
    title="Annual GDP of Nigeria and Uganda",
    x="Country Code", y = "GDP per Capita"
  )

Conclusion

The largest differences in poverty headcount and poverty rates across scenarios appear for Sub-Sahara Africa, where the projections for the most optimistic scenario imply over 300 million individuals living in extreme poverty in 2030. The analysis indicate that about 647 million people live in extreme poverty. This implies that the big bulk of the poverty reduction challenge is expected to be in Africa, which is expected to make progress slowly.

Extending Amber Ferger’s Example

dat <- as_tibble(read.csv('https://raw.githubusercontent.com/amberferger/DATA607_Masculinity/master/raw-responses.csv'))

str(dat)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1615 obs. of  98 variables:
##  $ X          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ StartDate  : Factor w/ 1378 levels "5/10/18 10:04",..: 3 4 5 6 7 8 9 1 2 19 ...
##  $ EndDate    : Factor w/ 1377 levels "5/10/18 10:11",..: 3 4 5 6 7 8 9 1 2 20 ...
##  $ q0001      : Factor w/ 5 levels "No answer","Not at all masculine",..: 4 4 5 5 5 5 4 4 5 4 ...
##  $ q0002      : Factor w/ 5 levels "No answer","Not at all important",..: 4 4 3 3 5 4 3 4 2 4 ...
##  $ q0004_0001 : Factor w/ 2 levels "Father or father figure(s)",..: 2 1 1 1 2 1 1 1 1 1 ...
##  $ q0004_0002 : Factor w/ 2 levels "Mother or mother figure(s)",..: 2 2 2 1 2 2 1 2 2 2 ...
##  $ q0004_0003 : Factor w/ 2 levels "Not selected",..: 1 1 1 2 2 1 2 1 1 1 ...
##  $ q0004_0004 : Factor w/ 2 levels "Not selected",..: 2 1 1 1 1 1 1 2 1 2 ...
##  $ q0004_0005 : Factor w/ 2 levels "Friends","Not selected": 2 2 2 2 2 2 1 1 1 2 ...
##  $ q0004_0006 : Factor w/ 2 levels "Not selected",..: 1 1 2 1 1 1 1 1 1 2 ...
##  $ q0005      : Factor w/ 3 levels "No","No answer",..: 3 3 1 1 3 3 1 3 1 1 ...
##  $ q0007_0001 : Factor w/ 6 levels "Never, and not open to it",..: 4 5 6 5 6 2 6 5 6 6 ...
##  $ q0007_0002 : Factor w/ 6 levels "Never, and not open to it",..: 4 6 6 5 5 6 6 5 6 5 ...
##  $ q0007_0003 : Factor w/ 6 levels "Never, and not open to it",..: 4 2 6 6 1 6 1 1 6 1 ...
##  $ q0007_0004 : Factor w/ 6 levels "Never, and not open to it",..: 4 5 5 5 2 5 5 2 6 5 ...
##  $ q0007_0005 : Factor w/ 6 levels "Never, and not open to it",..: 1 1 2 5 2 2 5 1 2 1 ...
##  $ q0007_0006 : Factor w/ 6 levels "Never, and not open to it",..: 1 5 4 4 6 4 5 4 4 4 ...
##  $ q0007_0007 : Factor w/ 6 levels "Never, and not open to it",..: 4 1 1 1 1 1 1 1 1 1 ...
##  $ q0007_0008 : Factor w/ 6 levels "Never, and not open to it",..: 6 4 5 1 4 6 6 6 4 4 ...
##  $ q0007_0009 : Factor w/ 6 levels "Never, and not open to it",..: 6 1 6 5 5 4 6 6 6 5 ...
##  $ q0007_0010 : Factor w/ 6 levels "Never, and not open to it",..: 1 6 5 1 2 2 1 2 1 2 ...
##  $ q0007_0011 : Factor w/ 6 levels "Never, and not open to it",..: 4 3 1 1 6 1 5 5 5 6 ...
##  $ q0008_0001 : Factor w/ 2 levels "Not selected",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ q0008_0002 : Factor w/ 2 levels "Not selected",..: 1 2 1 1 2 1 2 2 2 2 ...
##  $ q0008_0003 : Factor w/ 2 levels "Not selected",..: 2 1 1 1 1 1 1 1 2 1 ...
##  $ q0008_0004 : Factor w/ 2 levels "Not selected",..: 1 1 1 1 1 1 1 2 1 2 ...
##  $ q0008_0005 : Factor w/ 2 levels "Appearance of your genitalia",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ q0008_0006 : Factor w/ 2 levels "Not selected",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ q0008_0007 : Factor w/ 2 levels "Not selected",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ q0008_0008 : Factor w/ 2 levels "Not selected",..: 1 2 1 1 1 1 1 1 1 1 ...
##  $ q0008_0009 : Factor w/ 2 levels "Not selected",..: 2 2 2 1 1 1 1 1 2 2 ...
##  $ q0008_0010 : Factor w/ 2 levels "Not selected",..: 2 1 1 1 1 1 2 1 2 1 ...
##  $ q0008_0011 : Factor w/ 2 levels "Not selected",..: 1 1 1 1 1 1 2 1 2 1 ...
##  $ q0008_0012 : Factor w/ 2 levels "None of the above",..: 2 2 2 1 2 1 2 2 2 2 ...
##  $ q0009      : Factor w/ 7 levels "Employed, working full-time",..: 6 4 1 4 1 1 1 4 1 1 ...
##  $ q0010_0001 : Factor w/ 2 levels "Men make more money",..: NA NA 2 NA 2 2 2 NA 2 2 ...
##  $ q0010_0002 : Factor w/ 2 levels "Men are taken more seriously",..: NA NA 2 NA 2 2 2 NA 2 1 ...
##  $ q0010_0003 : Factor w/ 2 levels "Men have more choice",..: NA NA 2 NA 2 2 2 NA 2 2 ...
##  $ q0010_0004 : Factor w/ 2 levels "Men have more promotion/professional development opportunities",..: NA NA 2 NA 2 2 2 NA 2 1 ...
##  $ q0010_0005 : Factor w/ 2 levels "Men are explicitly praised more often",..: NA NA 2 NA 2 2 2 NA 2 2 ...
##  $ q0010_0006 : Factor w/ 2 levels "Men generally have more support from their managers",..: NA NA 2 NA 2 2 2 NA 2 1 ...
##  $ q0010_0007 : Factor w/ 2 levels "None of the above",..: NA NA 1 NA 1 2 1 NA 1 2 ...
##  $ q0010_0008 : Factor w/ 2 levels "Not selected",..: NA NA 1 NA 1 2 1 NA 1 1 ...
##  $ q0011_0001 : Factor w/ 2 levels "Managers want to hire and promote women",..: NA NA 1 NA 2 2 2 NA 2 2 ...
##  $ q0011_0002 : Factor w/ 2 levels "Greater risk of being accused of sexual harassment",..: NA NA 2 NA 1 2 2 NA 2 1 ...
##  $ q0011_0003 : Factor w/ 2 levels "Greater risk of being accused of being sexist or racist",..: NA NA 2 NA 1 1 2 NA 2 2 ...
##  $ q0011_0004 : Factor w/ 2 levels "None of the above",..: NA NA 2 NA 2 2 1 NA 1 2 ...
##  $ q0011_0005 : Factor w/ 2 levels "Not selected",..: NA NA 1 NA 1 1 1 NA 1 1 ...
##  $ q0012_0001 : Factor w/ 2 levels "Confronted the accused person",..: NA NA 2 NA 2 1 2 NA 2 2 ...
##  $ q0012_0002 : Factor w/ 2 levels "Contacted the HR department",..: NA NA 2 NA 2 2 2 NA 2 2 ...
##  $ q0012_0003 : Factor w/ 2 levels "Contacted the manager of the accused person",..: NA NA 2 NA 2 2 2 NA 2 2 ...
##  $ q0012_0004 : Factor w/ 2 levels "Not selected",..: NA NA 1 NA 1 1 2 NA 1 1 ...
##  $ q0012_0005 : Factor w/ 2 levels "Did not respond at all",..: NA NA 2 NA 2 2 2 NA 2 2 ...
##  $ q0012_0006 : Factor w/ 2 levels "Never witnessed sexual harassment",..: NA NA 2 NA 1 2 2 NA 1 1 ...
##  $ q0012_0007 : Factor w/ 2 levels "Not selected",..: NA NA 2 NA 1 1 1 NA 1 1 ...
##  $ q0013      : Factor w/ 6 levels "No answer","Other (please specify)",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ q0014      : Factor w/ 5 levels "A lot","No answer",..: NA NA 1 NA 1 4 3 NA 5 5 ...
##  $ q0015      : Factor w/ 3 levels "No","No answer",..: NA NA 1 NA 3 1 NA NA 1 1 ...
##  $ q0017      : Factor w/ 3 levels "No","No answer",..: 3 1 3 3 1 1 3 3 1 3 ...
##  $ q0018      : Factor w/ 6 levels "Always","Never",..: 6 5 6 1 1 1 6 4 1 1 ...
##  $ q0019_0001 : Factor w/ 2 levels "It?s the right thing to do",..: NA NA NA 1 2 1 NA 2 1 1 ...
##  $ q0019_0002 : Factor w/ 2 levels "Not selected",..: NA NA NA 1 1 1 NA 1 1 1 ...
##  $ q0019_0003 : Factor w/ 2 levels "Not selected",..: NA NA NA 1 1 1 NA 1 2 2 ...
##  $ q0019_0004 : Factor w/ 2 levels "Not selected",..: NA NA NA 1 2 1 NA 2 1 2 ...
##  $ q0019_0005 : Factor w/ 2 levels "Not selected",..: NA NA NA 2 1 1 NA 2 2 2 ...
##  $ q0019_0006 : Factor w/ 2 levels "Not selected",..: NA NA NA 1 1 1 NA 2 1 1 ...
##  $ q0019_0007 : Factor w/ 2 levels "Not selected",..: NA NA NA 1 1 1 NA 1 1 1 ...
##  $ q0020_0001 : Factor w/ 2 levels "Not selected",..: 2 1 1 1 1 1 2 2 2 1 ...
##  $ q0020_0002 : Factor w/ 2 levels "Ask for a verbal confirmation of consent",..: 1 2 2 2 1 1 1 2 2 1 ...
##  $ q0020_0003 : Factor w/ 2 levels "Make a physical move to see how they react",..: 1 2 2 2 2 2 1 2 1 1 ...
##  $ q0020_0004 : Factor w/ 2 levels "Every situation is different",..: 1 2 1 2 2 2 1 1 1 1 ...
##  $ q0020_0005 : Factor w/ 2 levels "It isn?t always clear how to gauge someone?s interest",..: 1 2 2 2 2 2 2 1 2 2 ...
##  $ q0020_0006 : Factor w/ 2 levels "Not selected",..: 1 2 1 1 1 1 1 1 1 1 ...
##  $ q0021_0001 : Factor w/ 2 levels "Not selected",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ q0021_0002 : Factor w/ 2 levels "Not selected",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ q0021_0003 : Factor w/ 2 levels "Contacted a past sexual partner to ask whether you went too far in any of you sexual encounters.",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ q0021_0004 : Factor w/ 2 levels "None of the above",..: 1 1 1 2 1 1 1 1 1 1 ...
##  $ q0022      : Factor w/ 3 levels "No","No answer",..: 1 1 1 2 1 1 1 3 1 1 ...
##  $ q0024      : Factor w/ 6 levels "Divorced","Married",..: 3 6 2 2 3 2 3 2 2 2 ...
##  $ q0025_0001 : Factor w/ 2 levels "Not selected",..: 1 1 1 1 1 1 1 1 2 1 ...
##  $ q0025_0002 : Factor w/ 2 levels "Not selected",..: 1 2 2 2 1 2 2 1 1 1 ...
##  $ q0025_0003 : Factor w/ 2 levels "No children",..: 1 2 2 2 1 2 2 1 2 1 ...
##  $ q0026      : Factor w/ 5 levels "Bisexual","Gay",..: 2 5 5 3 5 5 2 5 5 5 ...
##  $ q0028      : Factor w/ 5 levels "Asian","Black",..: 3 5 5 5 5 5 4 5 3 5 ...
##  $ q0029      : Factor w/ 6 levels "Associate's degree",..: 2 6 2 6 2 5 5 2 6 5 ...
##  $ q0030      : Factor w/ 51 levels "Alabama","Alaska",..: 33 36 23 15 36 15 12 33 5 38 ...
##  $ q0034      : Factor w/ 11 levels "$0-$9,999","$10,000-$24,999",..: 1 9 9 9 9 7 8 5 3 5 ...
##  $ q0035      : Factor w/ 9 levels "East North Central",..: 3 1 1 1 1 1 8 3 6 6 ...
##  $ q0036      : Factor w/ 5 levels "Android Phone / Tablet",..: 5 2 5 5 5 5 5 5 2 2 ...
##  $ race2      : Factor w/ 2 levels "Non-white","White": 1 2 2 2 2 2 1 2 1 2 ...
##  $ racethn4   : Factor w/ 4 levels "Black","Hispanic",..: 2 4 4 4 4 4 3 4 2 4 ...
##  $ educ3      : Factor w/ 3 levels "College or more",..: 1 3 1 3 1 1 1 1 3 1 ...
##  $ educ4      : Factor w/ 4 levels "College or more",..: 1 4 1 4 1 3 3 1 4 3 ...
##  $ age3       : Factor w/ 3 levels "18 - 34","35 - 64",..: 2 3 2 3 2 3 1 3 2 2 ...
##  $ kids       : Factor w/ 2 levels "Has children",..: 2 1 1 1 2 1 1 2 1 2 ...
##  $ orientation: Factor w/ 4 levels "Gay/Bisexual",..: 1 4 4 2 4 4 1 4 4 4 ...
##  $ weight     : num  1.714 1.247 0.516 0.601 1.033 ...

head(dat)

## # A tibble: 6 x 98
##       X StartDate EndDate q0001 q0002 q0004_0001 q0004_0002 q0004_0003
##   <int> <fct>     <fct>   <fct> <fct> <fct>      <fct>      <fct>     
## 1     1 5/10/18 ~ 5/10/1~ Some~ Some~ Not selec~ Not selec~ Not selec~
## 2     2 5/10/18 ~ 5/10/1~ Some~ Some~ Father or~ Not selec~ Not selec~
## 3     3 5/10/18 ~ 5/10/1~ Very~ Not ~ Father or~ Not selec~ Not selec~
## 4     4 5/10/18 ~ 5/10/1~ Very~ Not ~ Father or~ Mother or~ Other fam~
## 5     5 5/10/18 ~ 5/10/1~ Very~ Very~ Not selec~ Not selec~ Other fam~
## 6     6 5/10/18 ~ 5/10/1~ Very~ Some~ Father or~ Not selec~ Not selec~
## # ... with 90 more variables: q0004_0004 <fct>, q0004_0005 <fct>,
## #   q0004_0006 <fct>, q0005 <fct>, q0007_0001 <fct>, q0007_0002 <fct>,
## #   q0007_0003 <fct>, q0007_0004 <fct>, q0007_0005 <fct>,
## #   q0007_0006 <fct>, q0007_0007 <fct>, q0007_0008 <fct>,
## #   q0007_0009 <fct>, q0007_0010 <fct>, q0007_0011 <fct>,
## #   q0008_0001 <fct>, q0008_0002 <fct>, q0008_0003 <fct>,
## #   q0008_0004 <fct>, q0008_0005 <fct>, q0008_0006 <fct>,
## #   q0008_0007 <fct>, q0008_0008 <fct>, q0008_0009 <fct>,
## #   q0008_0010 <fct>, q0008_0011 <fct>, q0008_0012 <fct>, q0009 <fct>,
## #   q0010_0001 <fct>, q0010_0002 <fct>, q0010_0003 <fct>,
## #   q0010_0004 <fct>, q0010_0005 <fct>, q0010_0006 <fct>,
## #   q0010_0007 <fct>, q0010_0008 <fct>, q0011_0001 <fct>,
## #   q0011_0002 <fct>, q0011_0003 <fct>, q0011_0004 <fct>,
## #   q0011_0005 <fct>, q0012_0001 <fct>, q0012_0002 <fct>,
## #   q0012_0003 <fct>, q0012_0004 <fct>, q0012_0005 <fct>,
## #   q0012_0006 <fct>, q0012_0007 <fct>, q0013 <fct>, q0014 <fct>,
## #   q0015 <fct>, q0017 <fct>, q0018 <fct>, q0019_0001 <fct>,
## #   q0019_0002 <fct>, q0019_0003 <fct>, q0019_0004 <fct>,
## #   q0019_0005 <fct>, q0019_0006 <fct>, q0019_0007 <fct>,
## #   q0020_0001 <fct>, q0020_0002 <fct>, q0020_0003 <fct>,
## #   q0020_0004 <fct>, q0020_0005 <fct>, q0020_0006 <fct>,
## #   q0021_0001 <fct>, q0021_0002 <fct>, q0021_0003 <fct>,
## #   q0021_0004 <fct>, q0022 <fct>, q0024 <fct>, q0025_0001 <fct>,
## #   q0025_0002 <fct>, q0025_0003 <fct>, q0026 <fct>, q0028 <fct>,
## #   q0029 <fct>, q0030 <fct>, q0034 <fct>, q0035 <fct>, q0036 <fct>,
## #   race2 <fct>, racethn4 <fct>, educ3 <fct>, educ4 <fct>, age3 <fct>,
## #   kids <fct>, orientation <fct>, weight <dbl>

This data has several mising values, I will use the tidyr function from Tidyverse Package to exclude missing values

dat %>% drop_na()

## # A tibble: 36 x 98
##        X StartDate EndDate q0001 q0002 q0004_0001 q0004_0002 q0004_0003
##    <int> <fct>     <fct>   <fct> <fct> <fct>      <fct>      <fct>     
##  1    11 5/11/18 ~ 5/11/1~ Very~ Some~ Father or~ Mother or~ Other fam~
##  2    22 5/11/18 ~ 5/11/1~ Some~ Some~ Not selec~ Not selec~ Other fam~
##  3    55 5/11/18 ~ 5/11/1~ Some~ Not ~ Not selec~ Not selec~ Not selec~
##  4   120 5/13/18 ~ 5/13/1~ Some~ Not ~ Not selec~ Not selec~ Not selec~
##  5   127 5/13/18 ~ 5/13/1~ Some~ Very~ Not selec~ Not selec~ Not selec~
##  6   150 5/14/18 ~ 5/14/1~ Very~ Some~ Father or~ Not selec~ Other fam~
##  7   155 5/14/18 ~ 5/14/1~ Very~ Some~ Father or~ Mother or~ Other fam~
##  8   177 5/14/18 ~ 5/14/1~ Some~ Some~ Father or~ Not selec~ Not selec~
##  9   267 5/15/18 ~ 5/15/1~ Not ~ Not ~ Father or~ Mother or~ Not selec~
## 10   450 5/16/18 ~ 5/16/1~ Some~ Some~ Father or~ Not selec~ Not selec~
## # ... with 26 more rows, and 90 more variables: q0004_0004 <fct>,
## #   q0004_0005 <fct>, q0004_0006 <fct>, q0005 <fct>, q0007_0001 <fct>,
## #   q0007_0002 <fct>, q0007_0003 <fct>, q0007_0004 <fct>,
## #   q0007_0005 <fct>, q0007_0006 <fct>, q0007_0007 <fct>,
## #   q0007_0008 <fct>, q0007_0009 <fct>, q0007_0010 <fct>,
## #   q0007_0011 <fct>, q0008_0001 <fct>, q0008_0002 <fct>,
## #   q0008_0003 <fct>, q0008_0004 <fct>, q0008_0005 <fct>,
## #   q0008_0006 <fct>, q0008_0007 <fct>, q0008_0008 <fct>,
## #   q0008_0009 <fct>, q0008_0010 <fct>, q0008_0011 <fct>,
## #   q0008_0012 <fct>, q0009 <fct>, q0010_0001 <fct>, q0010_0002 <fct>,
## #   q0010_0003 <fct>, q0010_0004 <fct>, q0010_0005 <fct>,
## #   q0010_0006 <fct>, q0010_0007 <fct>, q0010_0008 <fct>,
## #   q0011_0001 <fct>, q0011_0002 <fct>, q0011_0003 <fct>,
## #   q0011_0004 <fct>, q0011_0005 <fct>, q0012_0001 <fct>,
## #   q0012_0002 <fct>, q0012_0003 <fct>, q0012_0004 <fct>,
## #   q0012_0005 <fct>, q0012_0006 <fct>, q0012_0007 <fct>, q0013 <fct>,
## #   q0014 <fct>, q0015 <fct>, q0017 <fct>, q0018 <fct>, q0019_0001 <fct>,
## #   q0019_0002 <fct>, q0019_0003 <fct>, q0019_0004 <fct>,
## #   q0019_0005 <fct>, q0019_0006 <fct>, q0019_0007 <fct>,
## #   q0020_0001 <fct>, q0020_0002 <fct>, q0020_0003 <fct>,
## #   q0020_0004 <fct>, q0020_0005 <fct>, q0020_0006 <fct>,
## #   q0021_0001 <fct>, q0021_0002 <fct>, q0021_0003 <fct>,
## #   q0021_0004 <fct>, q0022 <fct>, q0024 <fct>, q0025_0001 <fct>,
## #   q0025_0002 <fct>, q0025_0003 <fct>, q0026 <fct>, q0028 <fct>,
## #   q0029 <fct>, q0030 <fct>, q0034 <fct>, q0035 <fct>, q0036 <fct>,
## #   race2 <fct>, racethn4 <fct>, educ3 <fct>, educ4 <fct>, age3 <fct>,
## #   kids <fct>, orientation <fct>, weight <dbl>

Some Visualization with ggplot

ggplot(dat, aes(x=weight)) +
    geom_histogram(aes(y = ..density..), binwidth=density(dat$weight)$age) +
    geom_density(fill="red", alpha = 0.2) +
  labs(title= "Histogram for Weight", x = "Weight", y = "Frequency")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Data607_Tidyverse Project

Emmanuel Hayble-Gomes

12/03/2019

Tidyverse Assignment

Conclusion

Extending Amber Ferger’s Example