1. Data wrangling and visualization with college data

We will explore data on college majors and earnings, specifically the data behind the FiveThirtyEight story “The Economic Guide To Picking A College Major”.

This week we will use the read_csv function to read in our csv file:

⊕We read it in with the read_csv function, and save the result as a new data frame called college_recent_grads. Because read_csv is a function from tidyverse, this new data frame will be a tidy data frame.

college_recent_grads = read_csv("https://raw.githubusercontent.com/haihaolu/BUSN32100/master/data_files/recent-grads.csv")

## Rows: 173 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): major, major_category
## dbl (19): rank, major_code, total, sample_size, men, women, sharewomen, empl...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

college_recent_grads is a tidy data frame, with each row representing an observation and each column representing a variable.

To view the data, you can take a quick peek at your data frame and view its dimensions with the glimpse function.

glimpse(college_recent_grads)

## Rows: 173
## Columns: 21
## $ rank                        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,…
## $ major_code                  <dbl> 2419, 2416, 2415, 2417, 2405, 2418, 6202, …
## $ major                       <chr> "Petroleum Engineering", "Mining And Miner…
## $ major_category              <chr> "Engineering", "Engineering", "Engineering…
## $ total                       <dbl> 2339, 756, 856, 1258, 32260, 2573, 3777, 1…
## $ sample_size                 <dbl> 36, 7, 3, 16, 289, 17, 51, 10, 1029, 631, …
## $ men                         <dbl> 2057, 679, 725, 1123, 21239, 2200, 2110, 8…
## $ women                       <dbl> 282, 77, 131, 135, 11021, 373, 1667, 960, …
## $ sharewomen                  <dbl> 0.12, 0.10, 0.15, 0.11, 0.34, 0.14, 0.44, …
## $ employed                    <dbl> 1976, 640, 648, 758, 25694, 1857, 2912, 15…
## $ employed_fulltime           <dbl> 1849, 556, 558, 1069, 23170, 2038, 2924, 1…
## $ employed_parttime           <dbl> 270, 170, 133, 150, 5180, 264, 296, 553, 1…
## $ employed_fulltime_yearround <dbl> 1207, 388, 340, 692, 16697, 1449, 2482, 82…
## $ unemployed                  <dbl> 37, 85, 16, 40, 1672, 400, 308, 33, 4650, …
## $ unemployment_rate           <dbl> 0.0184, 0.1172, 0.0241, 0.0501, 0.0611, 0.…
## $ p25th                       <dbl> 95000, 55000, 50000, 43000, 50000, 50000, …
## $ median                      <dbl> 110000, 75000, 73000, 70000, 65000, 65000,…
## $ p75th                       <dbl> 125000, 90000, 105000, 80000, 75000, 10200…
## $ college_jobs                <dbl> 1534, 350, 456, 529, 18314, 1142, 1768, 97…
## $ non_college_jobs            <dbl> 364, 257, 176, 102, 4440, 657, 314, 500, 1…
## $ low_wage_jobs               <dbl> 193, 50, 0, 0, 972, 244, 259, 220, 3253, 3…

The description of the variables, i.e. the codebook, is given below.

Header	Description
`rank`	Rank by median earnings
`major_code`	Major code, FO1DP in ACS PUMS
`major`	Major description
`major_category`	Category of major from Carnevale et al
`total`	Total number of people with major
`sample_size`	Sample size (unweighted) of full-time, year-round ONLY (used for earnings)
`men`	Male graduates
`women`	Female graduates
`sharewomen`	Women as share of total
`employed`	Number employed (ESR == 1 or 2)
`employed_full_time`	Employed 35 hours or more
`employed_part_time`	Employed less than 35 hours
`employed_full_time_yearround`	Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35)
`unemployed`	Number unemployed (ESR == 3)
`unemployment_rate`	Unemployed / (Unemployed + Employed)
`median`	Median earnings of full-time, year-round workers
`p25th`	25th percentile of earnigns
`p75th`	75th percentile of earnings
`college_jobs`	Number with job requiring a college degree
`non_college_jobs`	Number with job not requiring a college degree
`low_wage_jobs`	Number in low-wage service jobs

Which major has the lowest unemployment rate?

In order to answer this question all we need to do is sort the data. We use the arrange function to do this, and sort it by the unemployment_rate variable. By default arrange sorts in ascending order, which is what we want here – we’re interested in the major with the lowest unemployment rate.

college_recent_grads %>%
  arrange(unemployment_rate)

## # A tibble: 173 × 21
##     rank major…¹ major major…² total sampl…³   men women share…⁴ emplo…⁵ emplo…⁶
##    <dbl>   <dbl> <chr> <chr>   <dbl>   <dbl> <dbl> <dbl>   <dbl>   <dbl>   <dbl>
##  1    53    4005 Math… Comput…   609       7   500   109   0.179     559     584
##  2    74    3801 Mili… Indust…   124       4   124     0   0           0     111
##  3    84    3602 Bota… Biolog…  1329       9   626   703   0.529    1010     946
##  4   113    1106 Soil… Agricu…   685       4   476   209   0.305     613     488
##  5   121    2301 Educ… Educat…   804       5   280   524   0.652     703     733
##  6    15    2409 Engi… Engine…  4321      30  3526   795   0.184    3608    2999
##  7    20    3201 Cour… Law & …  1148      14   877   271   0.236     930     808
##  8   120    2305 Math… Educat… 14237     123  3872 10365   0.728   13115   11259
##  9     1    2419 Petr… Engine…  2339      36  2057   282   0.121    1976    1849
## 10    65    1100 Gene… Agricu… 10399     158  6053  4346   0.418    8884    7589
## # … with 163 more rows, 10 more variables: employed_parttime <dbl>,
## #   employed_fulltime_yearround <dbl>, unemployed <dbl>,
## #   unemployment_rate <dbl>, p25th <dbl>, median <dbl>, p75th <dbl>,
## #   college_jobs <dbl>, non_college_jobs <dbl>, low_wage_jobs <dbl>, and
## #   abbreviated variable names ¹major_code, ²major_category, ³sample_size,
## #   ⁴sharewomen, ⁵employed, ⁶employed_fulltime

#There are multiple majors with zero unemployment rate, including Computers & Mathematics,  Industrial Arts & Consumer Services, Biology & Life Science, Agriculture & Natural Resources and Education

This gives us what we wanted, but not in an ideal form. First, the name of the major barely fits on the page. Second, some of the variables are not that useful (e.g. major_code, major_category) and some we might want front and center are not easily viewed (e.g. unemployment_rate).

We can use the select function to choose which variables to display, and in which order:

college_recent_grads %>%
  arrange(unemployment_rate) %>%
  select(rank, major, unemployment_rate)

## # A tibble: 173 × 3
##     rank major                                      unemployment_rate
##    <dbl> <chr>                                                  <dbl>
##  1    53 Mathematics And Computer Science                     0      
##  2    74 Military Technologies                                0      
##  3    84 Botany                                               0      
##  4   113 Soil Science                                         0      
##  5   121 Educational Administration And Supervision           0      
##  6    15 Engineering Mechanics Physics And Science            0.00633
##  7    20 Court Reporting                                      0.0117 
##  8   120 Mathematics Teacher Education                        0.0162 
##  9     1 Petroleum Engineering                                0.0184 
## 10    65 General Agriculture                                  0.0196 
## # … with 163 more rows

Ok, this is looking better, but do we really need all those decimal places in the unemployment variable? Not really!

1a. Round unemployment_rate: We create a new variable with the mutate function. In this case, we’re overwriting the existing unemployment_rate variable, by rounding it to 1 decimal places. Incomplete code is given below to guide you in the right direction, however you will need to fill in the blanks.

college_recent_grads %>%
  arrange(unemployment_rate) %>%
  select(rank, major, unemployment_rate) %>%
  mutate(unemployment_rate = round( unemployment_rate, 1))

## # A tibble: 173 × 3
##     rank major                                      unemployment_rate
##    <dbl> <chr>                                                  <dbl>
##  1    53 Mathematics And Computer Science                           0
##  2    74 Military Technologies                                      0
##  3    84 Botany                                                     0
##  4   113 Soil Science                                               0
##  5   121 Educational Administration And Supervision                 0
##  6    15 Engineering Mechanics Physics And Science                  0
##  7    20 Court Reporting                                            0
##  8   120 Mathematics Teacher Education                              0
##  9     1 Petroleum Engineering                                      0
## 10    65 General Agriculture                                        0
## # … with 163 more rows

Which major has the highest percentage of women?

To answer such a question we need to arrange the data in descending order. For example, if earlier we were interested in the major with the highest unemployment rate, we would use the following:

⊕The desc function specifies that we want unemployment_rate in descending order.

college_recent_grads %>%
  arrange(desc(unemployment_rate)) %>%
  select(rank, major, unemployment_rate)

## # A tibble: 173 × 3
##     rank major                                      unemployment_rate
##    <dbl> <chr>                                                  <dbl>
##  1     6 Nuclear Engineering                                    0.177
##  2    90 Public Administration                                  0.159
##  3    85 Computer Networking And Telecommunications             0.152
##  4   171 Clinical Psychology                                    0.149
##  5    30 Public Policy                                          0.128
##  6   106 Communication Technologies                             0.120
##  7     2 Mining And Mineral Engineering                         0.117
##  8    54 Computer Programming And Data Processing               0.114
##  9    80 Geography                                              0.113
## 10    59 Architecture                                           0.113
## # … with 163 more rows

1b. Using what you’ve learned so far, arrange the data in descending order with respect to proportion of women in a major, and display only the major, the total number of people with major, and proportion of women. Show only the top 3 majors by adding head(3) at the end of the pipeline.

college_recent_grads %>%
  arrange(desc(sharewomen)) %>%
  select(major, total, sharewomen) %>% 
  head(3)

## # A tibble: 3 × 3
##   major                                         total sharewomen
##   <chr>                                         <dbl>      <dbl>
## 1 Early Childhood Education                     37589      0.969
## 2 Communication Disorders Sciences And Services 38279      0.968
## 3 Medical Assisting Services                    11123      0.928

How do the distributions of median income compare across major categories?

⊕A percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value below which 20% of the observations may be found. (Source: Wikipedia

There are three types of incomes reported in this data frame: p25th, median, and p75th. These correspond to the 25th, 50th, and 75th percentiles of the income distribution of sampled individuals for a given major.

The question we want to answer “How do the distributions of median income compare across major categories?”. We need to do a few things to answer this question: First, we need to group the data by major_category. Then, we need a way to summarize the distributions of median income within these groups. This decision will depend on the shapes of these distributions. So first, we need to visualize the data.

1c.Let’s start simple and take a look at the distribution of all median incomes using geom_histogram, without considering the major categories.

ggplot(data=college_recent_grads, mapping=aes(x=median))+geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1d. Try binwidths of $1000 and $5000 and choose one. Explain your reasoning for your choice.

ggplot(data=college_recent_grads, mapping=aes(x=median))+geom_histogram(binwidth=1000)

ggplot(data=college_recent_grads, mapping=aes(x=median))+geom_histogram(binwidth=5000)

#I prefer the binwidth=1000 because you can see with more detail the distribution of the median income, with binwidth=5000, the columns are to wide and its hard to see the details of the distribution.

We can also calculate summary statistics for this distribution using the summarise function:

college_recent_grads %>%
  summarise(min = min(median), max = max(median),
            mean = mean(median), med = median(median),
            sd = sd(median), 
            q1 = quantile(median, probs = 0.25),
            q3 = quantile(median, probs = 0.75))

## # A tibble: 1 × 7
##     min    max   mean   med     sd    q1    q3
##   <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 22000 110000 40151. 36000 11470. 33000 45000

1e. Based on the shape of the histogram you created in the previous 1e, determine which of these summary statistics above (min, max, mean, med, sd, q1, q3) is useful for describing the distribution. Write up your description and include the summary statistic output as well.

#According to the histogram and the summary stats the mean of the income is 40k and the median 36k, this is consistent with what we see in the graph as the distribution has a longer tail to the right. #The statistics that are useful for discribing the distribution are the mean, the median, the Q1 and Q3.We can also use the minimun an maximum data points but as seen in the histogram this are outliers. We can also use the sd to see what is the interval of the distribution, but with the other stats is enough.

1f. Next, we facet the plot by major category. Plot the distribution of median income using a histogram, faceted by major_category. Use the binwidth you chose in 1e.

ggplot(data=college_recent_grads, mapping=aes(x=median))+geom_histogram(binwidth=1000)+facet_wrap(~major_category)

1g. Which major category is the most popular in this sample? To answer this question we create a summary table using group_by and summarise, which first groups the data, then counts the number of observations in each category and store the counts into a column named n. Arrange the results in descending order so that the major with the highest observations is on top.

college_recent_grads %>% 
  group_by(major_category) %>% 
  summarise(Count=n()) %>% 
  arrange(desc(Count))

## # A tibble: 16 × 2
##    major_category                      Count
##    <chr>                               <int>
##  1 Engineering                            29
##  2 Education                              16
##  3 Humanities & Liberal Arts              15
##  4 Biology & Life Science                 14
##  5 Business                               13
##  6 Health                                 12
##  7 Computers & Mathematics                11
##  8 Agriculture & Natural Resources        10
##  9 Physical Sciences                      10
## 10 Psychology & Social Work                9
## 11 Social Science                          9
## 12 Arts                                    8
## 13 Industrial Arts & Consumer Services     7
## 14 Law & Public Policy                     5
## 15 Communications & Journalism             4
## 16 Interdisciplinary                       1

#The most popular major is Engineering with 29 students

What types of majors do women tend to major in?

First, let’s create a new vector called stem_categories that lists the major categories that are considered STEM fields.

stem_categories = c("Biology & Life Science",
                     "Computers & Mathematics",
                     "Engineering",
                     "Physical Sciences")

Then, we can use this to create a new variable in our data frame indicating whether a major is STEM or not.

college_recent_grads = college_recent_grads %>%
  mutate(major_type = ifelse(major_category %in% stem_categories, "stem", "not stem"))

Let’s unpack this: with mutate we create a new variable called major_type, which is defined as "stem" if the major_category is in the vector called stem_categories we created earlier, and as "not stem" otherwise.

1h. Create a scatterplot of median income vs. proportion of women in that major, colored by whether the major is in a STEM field or not. Describe the association between these three variables.

ggplot(data=college_recent_grads, 
       mapping=aes(x=median, 
                   y=sharewomen, 
                   color=major_type)
                   ) +
  geom_point()+facet_wrap(~major_category)

## Warning: Removed 1 rows containing missing values (`geom_point()`).

ggplot(data=college_recent_grads, 
       mapping=aes(x=median, 
                   y=sharewomen, 
                   color=major_type)
                   ) +
  geom_point()

## Warning: Removed 1 rows containing missing values (`geom_point()`).

#What we see in the scatter plot is that usually women are in no STEAM majors, and this are dominated by women with a share above 50%, this majors also have a lower median income (<40,000).

1i.. We can use the logical operators to also filter our data for STEM majors whose median earnings is less than median for all majors’s median earnings, which we found to be $36,000 earlier. Your output should only show the major name and median, 25th percentile, and 75th percentile earning for that major and should be sorted such that the major with the lowest median earning is on top.

college_recent_grads %>% 
  filter(median< 36000) %>% 
  select(major,median, p25th, p75th) %>% 
  arrange(median)

## # A tibble: 81 × 4
##    major                                         median p25th p75th
##    <chr>                                          <dbl> <dbl> <dbl>
##  1 Library Science                                22000 20000 22000
##  2 Counseling Psychology                          23400 19200 26000
##  3 Educational Psychology                         25000 24000 34000
##  4 Clinical Psychology                            25000 25000 40000
##  5 Zoology                                        26000 20000 39000
##  6 Drama And Theater Arts                         27000 19200 35000
##  7 Composition And Rhetoric                       27000 20000 35000
##  8 Other Foreign Languages                        27500 22900 38000
##  9 Anthropology And Archeology                    28000 20000 38000
## 10 Communication Disorders Sciences And Services  28000 20000 40000
## # … with 71 more rows

2. Modeling the burritos of San Diego

First, you can load the data using the following.

burrito = read_csv('https://raw.githubusercontent.com/BUSN32100/data_files/master/burrito.csv')

## Rows: 328 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): Location, Burrito, Reviewer, Rec
## dbl (11): Hunger, Tortilla, Temp, Meat, Fillings, MeatToFilling, Uniformity,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

data wrangling and visualization

2a. Create a new variable called core_avg that is the average scores of the core dimensions of a burrito, except Cost. Add this new variable to the burrito data frame. Do this in one pipe, using the rowwise function. Incomplete code is given below to guide you in the right direction, however you will need to fill in the blanks.

⊕The rowwise function is useful for applying mathematical operations to each row.

⊕Core dimensions of a burrito: 2. Tortilla quality 3. Temperature 4. Meat quality 5. Non-meat filling quality 6. Meat to filling ratio 7. Uniformity 8. Salsa quality 9. Wrap integrity

burrito = burrito %>%
  rowwise() %>%
  mutate(core_avg = mean( c(Tortilla, Temp, Meat, Fillings, MeatToFilling, Uniformity, Salsa, Wrap) )) %>%
  ungroup()

Note that we end the pipeline with ungroup() to remove the effect of the rowwise function from earlier in the pipeline. The rowwise function works a lot like group_by(we will talk about this next week), except it groups the data frame one row at a time so that any operations applied to the data frame is done once per each row. This is helpful for finding the mean core dimension ratings for each row. However in the remainder of the analysis we don’t want to, say, calculate summary statistics for each row, or fit a model for each row. Hence we need to undo the effect of rowwise, which we can do with ungroup.

2b. Visualize the distribution of overall. Is the distribution skewed? What does that tell you about how reviewer rate burritos? Is this what you expected to see? Why, or why not? Include any summary statistics and visualizations you use in your response.

ggplot(data=burrito, mapping=aes(x=overall))+geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

burrito %>%
  summarise(min = min(overall), max = max(overall),
            mean = mean(overall), med = median(overall),
            sd = sd(overall), 
            q1 = quantile(overall, probs = 0.25),
            q3 = quantile(overall, probs = 0.75))

## # A tibble: 1 × 7
##     min   max  mean   med    sd    q1    q3
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1   1.5     5  3.57  3.72 0.689     3     4

#From the histogram we can see that the distribution is skewed to the left. This means that in average people gives good scores to the burritos. I would have expected to people to not give so many positive reviews and to be more centered around 3. 
#The summary statistics are consistent with what we see in the histogram, the average is 3.6, with a median of 3.7 meaning that the distribution is skewed to the left. The sd is 0.69 meaning that most of the reviews are centered around 3 and 4.5

2c. Visualize with a scatter point and describe the relationship between overall and the new variable you created, core_avg.

⊕Hint: See the help page for the function at http://ggplot2.tidyverse.org/reference/index.html.

ggplot(data=burrito, 
       mapping=aes(x=overall, 
                   y=core_avg)) +
  geom_point()

#We see that there is a positive relationship between the core average and the overall rating. So we could conclude, that people tend to put a heavy weight in what we define as the core dimensions of a burrito to give an overall rating. The results are consistent, when we have a low core_avg we also have a low overall rating.

Homework 2

BUS 32100

Due 11:59PM Apr 12 2023

Instructions: