Outcomes

Created by Nathan Garrett, updated 1/29/24

Outcomes:

Create RMarkdown documents
Understand syntax
- Titles with hashes (1-3)
- Links with parens
- Code rendering / output options
- Code rendering defaults
Publish your files online
R
- cor
- table and prop.table

For help, see:

Guided Task 0: Setup markdown options

This block should run, but have no output or code

For plots, we usually want to show the results, but not our code.

We sometimes will want to show the code and the output.

# Show a table of over50k values
table(t$over50k)

## 
## <=50K  >50K 
## 24720  7841

Guided Task 1: Data cleanup

Go back and make a cleaner version of our dataset.

Add a new column called over50k01 that is numeric (1 or 0).
Use rename to change names from hyphens to underscores (ie. from education-num to education_num)
Remove the fields capital_loss, education, fnlwgt, workclass, and capital_gain.

Now, print the number of rows, the number over 50k, and the % of the rows earning over 50k.

Show your code and the result.

# Create a new tibble t1. Hint: summarise
t1 <- t %>% 
  summarise(n = n(),
            number_over_50k=sum(over50k01),
            percent_over_50k = mean(over50k01))

# Show the tibble

Guided Task 2: Correlations

Use the cor function to show the correlation between different values.

Create a new tibble t2 with over50k01, age, and education. Then, use cor(tibble_name) to find the correlation between those values.

t2 <- t %>% 
select(age, education_num, hours_per_week, over50k01)
# Create a smaller tibble containing only numeric columns.

# Use the function cor to print the correlation between those variables
# to the terminal.

Graph

Create a graph showing the variable with the best correlation from the prior task. Pick a useful graphic. Show the graphic, but not the code.

Guided Task 3: Tables

Use the table function to show the % of people making over 50k by text field(s).

Start with table(text_column, over50k) to find the relationship between those values. Then, convert into proportions by wrapping the result of table with prop.table. Show the output only.

##     
##      <=50K >50K
##   1     51    0
##   2    162    6
##   3    317   16
##   4    606   40
##   5    487   27
##   6    871   62
##   7   1115   60
##   8    400   33
##   9   8826 1675
##   10  5904 1387
##   11  1021  361
##   12   802  265
##   13  3134 2221
##   14   764  959
##   15   153  423
##   16   107  306

##     
##             <=50K         >50K
##   1  0.0015662910 0.0000000000
##   2  0.0049752772 0.0001842695
##   3  0.0097355732 0.0004913854
##   4  0.0186112220 0.0012284635
##   5  0.0149565431 0.0008292129
##   6  0.0267497927 0.0019041184
##   7  0.0342434200 0.0018426952
##   8  0.0122846350 0.0010134824
##   9  0.2710604711 0.0514419090
##   10 0.1813212125 0.0425969718
##   11 0.0313565308 0.0110868831
##   12 0.0246306932 0.0081385707
##   13 0.0962501152 0.0682104358
##   14 0.0234636528 0.0294524124
##   15 0.0046988729 0.0129910015
##   16 0.0032861399 0.0093977458

Guided Task 4: Convert text into 0/1 variables

We often want to group variables, or turn text values into numbers for easier analysis.

Create a new tibble called t_numbers. Add:

is_male: 0/1 column
is_white: 0/1 column

Then, print a correlation test with those two new columns and is_over50k. Are any of these better than your other numeric columns?

On your own!

Your goal is to find or create a variable with the highest correlation to over50k01. Try different text values, and then turn them into 0/1 fields. Print out the correlations, as well as a visualization of each of the variables.

Start by viewing a vis of each key variable. Then, pull out variables as needed.

US Census