Outcomes

Created by Nathan Garrett, updated 1/29/24

Outcomes:

For help, see:

Guided Task 0: Setup markdown options

This block should run, but have no output or code

# Libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Open data
data_url <- "https://raw.githubusercontent.com/profgarrett/profgarrettdata/main/census_adults_making_over_50k.csv"

# Open up the data
t_raw <- read_csv(data_url,
                  comment = "|")
## Rows: 32561 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): workclass, education, marital-status, occupation, relationship, rac...
## dbl (6): age, fnlwgt, education-num, capital-gain, capital-loss, hours-per-week
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Add our data clean-up here
t <- t_raw


# Use View(t) to make sure that your changes worked

For plots, we usually want to show the results, but not our code.

# Print a histogram of t$age, using the hist function.

We sometimes will want to show the code and the output.

# Show a table of over50k values

Guided Task 1: Data cleanup

Go back and make a cleaner version of our dataset.

Now, print the number of rows, the number over 50k, and the % of the rows earning over 50k.

Show your code and the result.

# Create a new tibble t1. Hint: summarise

# Show the tibble

Guided Task 2: Correlations

Use the cor function to show the correlation between different values.

Create a new tibble t2 with over50k01, age, and education. Then, use cor(tibble_name) to find the correlation between those values.

# Create a smaller tibble containing only numeric columns.

# Use the function cor to print the correlation between those variables
# to the terminal.

Graph

Create a graph showing the variable with the best correlation from the prior task. Pick a useful graphic. Show the graphic, but not the code.

Guided Task 3: Tables

Use the table function to show the % of people making over 50k by text field(s).

Start with table(text_column, over50k) to find the relationship between those values. Then, convert into proportions by wrapping the result of table with prop.table. Show the output only.

Guided Task 4: Convert text into 0/1 variables

We often want to group variables, or turn text values into numbers for easier analysis.

Create a new tibble called t_numbers. Add:

Then, print a correlation test with those two new columns and is_over50k. Are any of these better than your other numeric columns?

On your own!

Your goal is to find or create a variable with the highest correlation to over50k01. Try different text values, and then turn them into 0/1 fields. Print out the correlations, as well as a visualization of each of the variables.

Start by viewing a vis of each key variable. Then, pull out variables as needed.