Exploratory Data Analysis

Q1. What does the row represent?
Q2. What type of data is the variable, education (i.e., numeric, character, logical)?
Q3. What type of R object is PUMS_cleaned (i.e., vector, matrix, data frame, list)? And why?
Q4. Describe the first observation (first row) using all variables.
Q5. How many people have BA or higher?
Q6. How many people have majored in finance?
Q7. Create a histogram for income. What’s the story (i.e, What’s the typical income, What’s the range of income most people make)?
Q8. What’s the top field of degree in terms of median income? How much do they make?

See below for infomration on Public Use Microdata Sample (PUMS)

Public Use Microdata Sample (PUMS) Documentation https://www.census.gov/programs-surveys/acs/technical-documentation/pums.html 2010 Census Public Use Microdata Area (PUMA) Reference Maps - New Hampshire https://www.census.gov/geo/maps-data/maps/2010puma/st33_nh.html

# Load packages
library(tidyverse)

# Import data
PUMS_cleaned <- read.csv("~/R/busStat/Data/PUMS_cleaned.csv") %>% as_tibble()

PUMS_cleaned
## # A tibble: 67,248 x 6
##        X  PUMA   age education  field_of_degree           income
##    <int> <int> <int> <fct>      <fct>                      <int>
##  1     1  1000    87 lessthanBA <NA>                       11800
##  2     2   900    42 lessthanBA <NA>                        8800
##  3     3   800    43 BAorhigher English Language           10000
##  4     4   800    43 lessthanBA <NA>                      112000
##  5     5   800    14 lessthanBA <NA>                          NA
##  6     6   800    11 lessthanBA <NA>                          NA
##  7     7   900    63 lessthanBA <NA>                       23900
##  8     8   900    59 BAorhigher Early Childhood Education  34600
##  9     9   900    65 lessthanBA <NA>                        9400
## 10    10   300    50 lessthanBA <NA>                       18000
## # ... with 67,238 more rows

Q1. What does the row represent?

The row represents individual people or housing units from New Hampshire

Q2. What type of data is the variable, education (i.e., numeric, character, logical)?

The variable education is a character type data

Q3. What type of R object is PUMS_cleaned (i.e., vector, matrix, data frame, list)? And why?

The R object PUMS_cleaned is a data frame because each variable is catorgorized by the individual person or the housing units from New Hampshire and looks like a table with different data type.

Q4. Describe the first observation (first row) using all variables.

Hint: Use View(). The first observation is a 87 year old who is from the seacoast of New Hampshire which is in Portsmouth who does not have their BA so their field of degree is NA and is making an income of $11,800 a year.

Q5. How many people have BA or higher?

Hint: Use count().

PUMS_cleaned %>% count(education)
## # A tibble: 2 x 2
##   education      n
##   <fct>      <int>
## 1 BAorhigher 18563
## 2 lessthanBA 48685

There are about 18,563 in this sample who have a BA or higher who live in New Hampshire

Q6. How many people have majored in finance?

Hint: Take PUMS_cleaned, pipe it to dplyr::count, and pipe it to dplyr::filter.

PUMS_cleaned %>% count(field_of_degree) %>% filter(field_of_degree == "Finance")
## # A tibble: 1 x 2
##   field_of_degree     n
##   <fct>           <int>
## 1 Finance           185

There are 185 people who have majored in finance in this data set

Q7. Create a histogram for income. What’s the story (i.e, What’s the typical income, What’s the range of income most people make)?

Hint: Add scale_x_log10() to the code to tranform the data and reveal its structure. Refer to the ggplot2 cheatsheet. Google it.

ggplot(PUMS_cleaned, aes(income)) + geom_histogram() + scale_x_log10()

The typical income is, about $45,000 and the range of income most people make is less than $100,000 a year

Q8. What’s the top field of degree in terms of median income? How much do they make?

Hint: Take PUMS_cleaned, pipe it to dplyr::group_by, pipe it to dplyr::summarise, and pipe it to dplyr::arrange.

PUMS_cleaned %>% 
  group_by(field_of_degree) %>%
  summarise(median_income = median(income)) %>% 
  arrange(desc(median_income))
## # A tibble: 169 x 2
##    field_of_degree                             median_income
##    <fct>                                               <dbl>
##  1 Petroleum Engineering                              188000
##  2 Materials Science                                  154800
##  3 Nuclear Engineering                                148300
##  4 Physical Sciences                                  129000
##  5 Mechanical Engineering Related Technologies        111900
##  6 Pharmacy Pharmaceutical Sciences                   106700
##  7 Biological Engineering                             101800
##  8 Metallurgical Engineering                           99700
##  9 Naval Architecture                                  97000
## 10 Electrical Engineering                              94000
## # ... with 159 more rows

The top field of degree in terms of median income is, Petroleum Engineering making $188,000.