Analyzing US Census Data in R

Q1. What does the row represent?
Q2. What type of data is the variable, education (i.e., numeric, character, logical)?
Q3. What type of R object is PUMS_cleaned (i.e., vector, matrix, data frame, list)? And why?
Q4. Describe the first observation (first row) using all variables.
Q5. How many people have BA or higher?
Q6. How many people have majored in finance?
Q7. Create a histogram for income. What’s the story (i.e, What’s the typical income, What’s the range of income most people make)?
Q8. What’s the top field of degree in terms of median income? How much do they make?

See below for infomration on Public Use Microdata Sample (PUMS)

Public Use Microdata Sample (PUMS) Documentation https://www.census.gov/programs-surveys/acs/technical-documentation/pums.html 2010 Census Public Use Microdata Area (PUMA) Reference Maps - New Hampshire https://www.census.gov/geo/maps-data/maps/2010puma/st33_nh.html

# Load packages
library(tidyverse)

# Import data
PUMS_cleaned <- read.csv("~/R/business sat/DATA/PUMS_cleaned.csv") %>% as_tibble()

PUMS_cleaned
## # A tibble: 67,248 x 6
##        X  PUMA   age education  field_of_degree           income
##    <int> <int> <int> <fct>      <fct>                      <int>
##  1     1  1000    87 lessthanBA <NA>                       11800
##  2     2   900    42 lessthanBA <NA>                        8800
##  3     3   800    43 BAorhigher English Language           10000
##  4     4   800    43 lessthanBA <NA>                      112000
##  5     5   800    14 lessthanBA <NA>                          NA
##  6     6   800    11 lessthanBA <NA>                          NA
##  7     7   900    63 lessthanBA <NA>                       23900
##  8     8   900    59 BAorhigher Early Childhood Education  34600
##  9     9   900    65 lessthanBA <NA>                        9400
## 10    10   300    50 lessthanBA <NA>                       18000
## # ... with 67,238 more rows

Q1. What does the row represent?

education and the field of education

Q2. What type of data is the variable, education (i.e., numeric, character, logical)?

nemeric

Q3. What type of R object is PUMS_cleaned (i.e., vector, matrix, data frame, list)? And why?

data frame since there is more than one variable

Q4. Describe the first observation (first row) using all variables.

X , PUMA, age ,educatio,field_of_degree, income

Q5. How many people have BA or higher?

PUMS_cleaned %>%count(education)
## # A tibble: 2 x 2
##   education      n
##   <fct>      <int>
## 1 BAorhigher 18563
## 2 lessthanBA 48685

1 BAorhigher 18563

Q6. How many people have majored in finance?

Hint: Take PUMS_cleaned, pipe it to dplyr::count, and pipe it to dplyr::filter.

PUMS_cleaned %>%count(field_of_degree) %>%filter(field_of_degree == "Finance")
## # A tibble: 1 x 2
##   field_of_degree     n
##   <fct>           <int>
## 1 Finance           185

Finance 185

Q7. Create a histogram for income. What’s the story (i.e, What’s the typical income, What’s the range of income most people make)?

Hint: Add scale_x_log10() to the code to tranform the data and reveal its structure. Refer to the ggplot2 cheatsheet. Google it.

ggplot(PUMS_cleaned, aes(income))+
  geom_histogram() +
  scale_x_log10()

Q8. What’s the top field of degree in terms of median income? How much do they make?

Hint: Take PUMS_cleaned, pipe it to dplyr::group_by, pipe it to dplyr::summarise, and pipe it to dplyr::arrange.

PUMS_cleaned %>%
  group_by(field_of_degree) %>%
  summarise(median_income = median(income),n = n()) %>%
  arrange(desc(median_income))
## # A tibble: 169 x 3
##    field_of_degree                             median_income     n
##    <fct>                                               <dbl> <int>
##  1 Petroleum Engineering                              188000     1
##  2 Materials Science                                  154800     4
##  3 Nuclear Engineering                                148300     7
##  4 Physical Sciences                                  129000     7
##  5 Mechanical Engineering Related Technologies        111900    14
##  6 Pharmacy Pharmaceutical Sciences                   106700    83
##  7 Biological Engineering                             101800     8
##  8 Metallurgical Engineering                           99700     7
##  9 Naval Architecture                                  97000    18
## 10 Electrical Engineering                              94000   465
## # ... with 159 more rows

Petroleum Engineering they make 188000.0