Created by Nathan Garrett, updated 1/29/24
Outcomes:
For help, see:
This block should run, but have no output or code
# Libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Open data
data_url <- "https://raw.githubusercontent.com/profgarrett/profgarrettdata/main/census_adults_making_over_50k.csv"
# Open up the data
t_raw <- read_csv(data_url,
comment = "|")
## Rows: 32561 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): workclass, education, marital-status, occupation, relationship, rac...
## dbl (6): age, fnlwgt, education-num, capital-gain, capital-loss, hours-per-week
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Add our data clean-up here
t <- t_raw
# Use View(t) to make sure that your changes worked
For plots, we usually want to show the results, but not our code.
# Print a histogram of t$age, using the hist function.
We sometimes will want to show the code and the output.
# Show a table of over50k values
Go back and make a cleaner version of our dataset.
rename to change names from hyphens to underscores
(ie. from education-num to education_num)Now, print the number of rows, the number over 50k, and the % of the rows earning over 50k.
Show your code and the result.
# Create a new tibble t1. Hint: summarise
# Show the tibble
Use the cor function to show the correlation between different values.
Create a new tibble t2 with over50k01, age, and education. Then, use
cor(tibble_name) to find the correlation between those
values.
# Create a smaller tibble containing only numeric columns.
# Use the function cor to print the correlation between those variables
# to the terminal.
Create a graph showing the variable with the best correlation from the prior task. Pick a useful graphic. Show the graphic, but not the code.
Use the table function to show the % of people making over 50k by text field(s).
Start with table(text_column, over50k) to find the
relationship between those values. Then, convert into proportions by
wrapping the result of table with prop.table.
Show the output only.
We often want to group variables, or turn text values into numbers for easier analysis.
Create a new tibble called t_numbers. Add:
is_male: 0/1 columnis_white: 0/1 columnThen, print a correlation test with those two new columns and is_over50k. Are any of these better than your other numeric columns?
Your goal is to find or create a variable with the highest correlation to over50k01. Try different text values, and then turn them into 0/1 fields. Print out the correlations, as well as a visualization of each of the variables.
Start by viewing a vis of each key variable. Then, pull out variables as needed.