Exploratory Data Analysis

Q1. What does the row represent?
Q2. What type of data is the variable, education (i.e., numeric, character, logical)?
Q3. What type of R object is PUMS_cleaned (i.e., vector, matrix, data frame, list)? And why?
Q4. Describe the first observation (first row) using all variables.
Q5. How many people have BA or higher?
Q6. How many people have majored in finance?
Q7. Create a histogram for income. What’s the story (i.e, What’s the typical income, What’s the range of income most people make)?
Q8. What’s the top field of degree in terms of median income? How much do they make?

See below for infomration on Public Use Microdata Sample (PUMS)

Public Use Microdata Sample (PUMS) Documentation https://www.census.gov/programs-surveys/acs/technical-documentation/pums.html 2010 Census Public Use Microdata Area (PUMA) Reference Maps - New Hampshire https://www.census.gov/geo/maps-data/maps/2010puma/st33_nh.html

# Load packages
library(tidyverse)

# Import data
PUMS_cleaned <- read.csv("~/R/busStat/Data/PUMS_cleaned.csv") %>% as_tibble()

PUMS_cleaned
## # A tibble: 67,248 x 6
##        X  PUMA   age education  field_of_degree           income
##    <int> <int> <int> <fct>      <fct>                      <int>
##  1     1  1000    87 lessthanBA <NA>                       11800
##  2     2   900    42 lessthanBA <NA>                        8800
##  3     3   800    43 BAorhigher English Language           10000
##  4     4   800    43 lessthanBA <NA>                      112000
##  5     5   800    14 lessthanBA <NA>                          NA
##  6     6   800    11 lessthanBA <NA>                          NA
##  7     7   900    63 lessthanBA <NA>                       23900
##  8     8   900    59 BAorhigher Early Childhood Education  34600
##  9     9   900    65 lessthanBA <NA>                        9400
## 10    10   300    50 lessthanBA <NA>                       18000
## # ... with 67,238 more rows

Q1. What does the row represent?

Each row is a random sample selection.

Q2. What type of data is the variable, education (i.e., numeric, character, logical)?

Character.

Q3. What type of R object is PUMS_cleaned (i.e., vector, matrix, data frame, list)? And why?

PUMS_cleaned is a data frame because it can hold different types of variables.

Q4. Describe the first observation (first row) using all variables.

Hint: Use View(). 87 years old in Rockingham County with less than a bachelor’s degree with no field of degree and an income of $11,800.

Q5. How many people have BA or higher?

Hint: Use count(). There are 18,563 people that have a BA or higher.

PUMS_cleaned%>%count(education)
## # A tibble: 2 x 2
##   education      n
##   <fct>      <int>
## 1 BAorhigher 18563
## 2 lessthanBA 48685

Q6. How many people have majored in finance?

Hint: Take PUMS_cleaned, pipe it to dplyr::count, and pipe it to dplyr::filter. 185 people majored in finance.

PUMS_cleaned%>%count(field_of_degree) %>% filter(field_of_degree == "Finance")
## # A tibble: 1 x 2
##   field_of_degree     n
##   <fct>           <int>
## 1 Finance           185

Q7. Create a histogram for income. What’s the story (i.e, What’s the typical income, What’s the range of income most people make)?

Hint: Add scale_x_log10() to the code to tranform the data and reveal its structure. Refer to the ggplot2 cheatsheet. Google it.

The typical income is around 70,000 to 80,000, and the range most people make is between 0 and 60,000.

ggplot(PUMS_cleaned, aes(income)) +
  geom_histogram() +
  scale_x_log10()

Q8. What’s the top field of degree in terms of median income? How much do they make?

Hint: Take PUMS_cleaned, pipe it to dplyr::group_by, pipe it to dplyr::summarise, and pipe it to dplyr::arrange.

The top field of degree in terms of median income is Petrolium Engineering making 188,000.

PUMS_cleaned %>%
  group_by(field_of_degree) %>%
  summarise(median_income = median(income)) %>%
  arrange(desc(median_income))
## # A tibble: 169 x 2
##    field_of_degree                             median_income
##    <fct>                                               <dbl>
##  1 Petroleum Engineering                              188000
##  2 Materials Science                                  154800
##  3 Nuclear Engineering                                148300
##  4 Physical Sciences                                  129000
##  5 Mechanical Engineering Related Technologies        111900
##  6 Pharmacy Pharmaceutical Sciences                   106700
##  7 Biological Engineering                             101800
##  8 Metallurgical Engineering                           99700
##  9 Naval Architecture                                  97000
## 10 Electrical Engineering                              94000
## # ... with 159 more rows