Question 1: Read the data into R and turn it into a tidy dataset. Answer: When starting this first question, first and foremost, I need to run library(tidyverse), then I ran library(readxl). Then to have R read the dataset I started to run the code. First I wanted to refer to the dataset as tuition in all future coding, then I told R to use the function read_excel to do two things. 1. I wanted to tell R that I want to read a new dataset, and 2. I wanted to tell R that it is reading a dataset in excel. From there I told R the path to the file. Which for me is Dropbox -> DATA101 -> Data -> Raw -> us-avg-tuition.xlsx. This is slightly different than Canvas due to the technology issues this week. I had to rename the file to have dashes instead of underscores so my system didn’t flag it as a duplicate. By running these lines of code, I added my new dataset in my environment as tuition. Once I did those first few steps, I used my new title “tuition” to tell R that everything that follows is relating to my us-avg-tuition dataset. Then, I need to clean the data because right now it spread across many columns, which makes coding more difficult than is needed. I use the pivot_long function because the dataset has one variable (year) across multiple columns. Instead of listing out all eleven years in the dataset, I can use a semicolon to tell R the data includes all years starting in 2004 and ending in 2015. Then I can tell my R the names and values.
Question 2: Calculate the average tuition in each state across all years in the data, and sort it from highest to lowest. Do you notice any patterns? Answer: First thing that I did to answer this question was I gave my tidy a name. I chose to name my tidy data: avg_st_tuition; this was in attempt to be able to remember which variable it was and remember how I got it throughout my assignment. Then, to organize the data, I grouped my data by state, allowing for the first column of data on the table. Once I had that ready to go, it was time to find the mean for each state. This was completed in a summarise line with the following mathematical equation: avg_state_tuition = mean(tuition). Then, I arranged my data to display states starting with the highest tuition average. Finally, in order to be able to see the table, I had to use the print function. Analysis: A pattern I noticed was that the closer to the east coast you attend school, the higher likelihood you will live in a state with a higher tuition average. Perhaps, this is because eight Ivy League schools are on the east coast. In fact, we can gather more observations to test this hypothesis. Pennsylvania, home of The University of Pennsylvania, has the 4th highest tuition average. New Jersey places #3 which is home to Princeton University. Massachusetts is 9th and is home to Harvard University. This observation is only one of the many factors that make up tuition costs and consequently, the averages per state. We cannot give this observation full blame for 3 home states to Ivy League schools being in the top 10; however, it is a pattern worth noting.
avg_st_tuition <- tuition_long%>%
group_by(State)%>%
summarise(avg_state_tuition = mean(tuition))%>%
arrange(-avg_state_tuition)
print(avg_st_tuition)
## # A tibble: 50 × 2
## State avg_state_tuition
## <chr> <dbl>
## 1 Vermont 13067.
## 2 New Hampshire 12781.
## 3 New Jersey 12054.
## 4 Pennsylvania 11970.
## 5 Illinois 11228.
## 6 Michigan 10477.
## 7 South Carolina 10377.
## 8 Delaware 10099.
## 9 Massachusetts 10058.
## 10 Ohio 9942.
## # ℹ 40 more rows
Question 3: Plot the average tuition in each state using a bar chart, and arrange it from highest to lowest. HINT: you’ll need to read the documentation for geom_bar to figure out what options to use. And to arrange the states in your plot, you’ll need to look up the reorder option in ggplot. Answer 3:First, I told R what data I would like it to pull from, in this case avg_st_tuition, then I began coding to create a bar graph. Starting with ggplot()+, I made sure to reorder my variable so it went in descending order, then continued with the rest of my aesthetics. I added the red on the end to make it easier to read since there are 50 lines of data on one graph. Then, using my x and y labs, I named the variable on the graph, then did a ggtitle, which gave the graph the title at the top. To make the graph more legible, and not have the names of the states overlapping on each other, I used theme(axis.text.x = element_text(angle = 90, hjust = 1)). This made the name of states easy to read because of their vertical orientation. Data analysis: Based on the graph, we can see that Wyoming has the lowest tuition average out of the 50 states with Vermont being the highest tuition average. I notice relationship between the average tuition and the state’s political affiliation. Most of the republican states are on the left of the graph, meaning a significant amount of republican states have lower average tuition costs. Against my preconceived notions, I was expecting to see a relation between state coastal location and the average tuition. However, there does not appear to be a relation between these two variables. Of course, these observations could be irrelevant because tuition averages are a accumulation of multiple factors that is beyond the scope of our dataset, but these patterns could also, potentially, have significant merit in predicting future tuition averages.
avg_st_tuition %>%
ggplot() +
geom_bar(aes(x = reorder(State, -avg_state_tuition), y = avg_state_tuition), stat = "identity") +
stat_summary(fun = "mean", geom = "point", aes(x = reorder(State, -avg_state_tuition), y = avg_state_tuition), color = "red", size = 3) +
xlab("State") +
ylab("Average State Tuition") +
ggtitle("Average State Tuition Across 11 Years")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Question 4: Does the average tuition cost decrease for any state? Which
states have the largest/smallest increases over this period? To adjust
for the fact that states have very different tuition costs in 2004,
calculate this is terms of percentage change. HINT: Thank about how
arrange() combined with first() and last() would be helpful to
calculating this change. Answer: To begin this question, I told R that I
wanted to get the percentage. To do this I first made my new variable,
avg_state_tuition_pct. Then, I told R to summarise all of the values,
then calculated the percentage. My final steps were to arrange the table
and print the table. Data Analysis: The state that decreased in the
tuition average was Ohio with -1.8% (-1.754094% not rounded percentage).
What this means from 2004 to 2015 the average cost of tuition in Ohio
went down in eleven years. This is good information for data analyst to
be able to predict tuition in the future in this state. Maryland had the
smallest increase in their tuition average with a 7.4% (7.417266% not
rounded). The state with the largest increase in the past eleven years
was Hawaii with an increase of 138.5% (138.477470% not rounded). We can
constitute these increases, and Ohio’s decrease, to the changes in
political leadership, student needs, or other factors. These factors are
only speculation, but could hold some merit.
avg_state_tuition_pct <- tuition_long %>%
group_by(State)%>%
summarise(
starting_tuition = first(tuition),
current_tuition = last(tuition),
avg_state_tuition_pct = ((last(tuition) - first(tuition)) / first(tuition)) *100) %>%
arrange(-avg_state_tuition_pct)
print(avg_state_tuition_pct, n = 50)
## # A tibble: 50 × 4
## State starting_tuition current_tuition avg_state_tuition_pct
## <chr> <dbl> <dbl> <dbl>
## 1 Hawaii 4267. 10175. 138.
## 2 Colorado 4704. 9748. 107.
## 3 Arizona 5138. 10646. 107.
## 4 Georgia 4298. 8447. 96.5
## 5 Nevada 3621. 6667. 84.1
## 6 Louisiana 4453. 7871. 76.8
## 7 California 5286. 9270. 75.4
## 8 Alabama 5683. 9751. 71.6
## 9 Tennessee 5426. 9263. 70.7
## 10 Kentucky 5640. 9567. 69.6
## 11 Virginia 7030. 11819. 68.1
## 12 Oklahoma 4454. 7450. 67.2
## 13 Washington 6192. 10288. 66.2
## 14 Florida 3848. 6360. 65.3
## 15 Illinois 8183. 13189. 61.2
## 16 Kansas 5345. 8530. 59.6
## 17 West Virginia 4575. 7171. 56.7
## 18 North Carolina 4493. 6973. 55.2
## 19 Utah 4125. 6363. 54.2
## 20 Rhode Island 7476. 11390. 52.4
## 21 Alaska 4328. 6571. 51.8
## 22 Michigan 7931. 11991. 51.2
## 23 Idaho 4525. 6818. 50.7
## 24 New Hampshire 10188. 15160. 48.8
## 25 South Dakota 5479. 8055. 47.0
## 26 Connecticut 7984. 11397. 42.8
## 27 Texas 6395. 9117. 42.6
## 28 Oregon 6579. 9371. 42.4
## 29 Mississippi 5029. 7147. 42.1
## 30 South Carolina 8330. 11816. 41.8
## 31 Delaware 8353. 11676. 39.8
## 32 Arkansas 5772. 7867. 36.3
## 33 Maine 7058. 9573. 35.6
## 34 Vermont 11067. 14993. 35.5
## 35 Wisconsin 6575. 8815. 34.1
## 36 Minnesota 8144. 10831. 33.0
## 37 North Dakota 5804. 7688. 32.5
## 38 New Jersey 10054. 13303. 32.3
## 39 Massachusetts 8863. 11588. 30.7
## 40 New Mexico 4926. 6355. 29.0
## 41 Pennsylvania 10394. 13395. 28.9
## 42 Nebraska 5947. 7608. 27.9
## 43 Indiana 7368. 9120. 23.8
## 44 New York 6235. 7644. 22.6
## 45 Wyoming 4086. 4891. 19.7
## 46 Iowa 6813. 7877. 15.6
## 47 Missouri 7477. 8564. 14.5
## 48 Montana 5630. 6351. 12.8
## 49 Maryland 8531. 9163. 7.42
## 50 Ohio 10378. 10196. -1.75
avg_state_tuition_abs <- tuition_long %>%
group_by(State)%>%
summarise(
starting_tuition = first(tuition),
current_tuition = last(tuition),
avg_state_tuition_abs = last(tuition) - first(tuition)) %>%
arrange(-avg_state_tuition_abs)
print(avg_state_tuition_abs, n = 50)
## # A tibble: 50 × 4
## State starting_tuition current_tuition avg_state_tuition_abs
## <chr> <dbl> <dbl> <dbl>
## 1 Hawaii 4267. 10175. 5908.
## 2 Arizona 5138. 10646. 5508.
## 3 Colorado 4704. 9748. 5044.
## 4 Illinois 8183. 13189. 5006.
## 5 New Hampshire 10188. 15160. 4972.
## 6 Virginia 7030. 11819. 4789.
## 7 Georgia 4298. 8447. 4149.
## 8 Washington 6192. 10288. 4096.
## 9 Alabama 5683. 9751. 4068.
## 10 Michigan 7931. 11991. 4060.
## 11 California 5286. 9270. 3984.
## 12 Kentucky 5640. 9567. 3927.
## 13 Vermont 11067. 14993. 3926.
## 14 Rhode Island 7476. 11390. 3914.
## 15 Tennessee 5426. 9263. 3838.
## 16 South Carolina 8330. 11816. 3486.
## 17 Louisiana 4453. 7871. 3418.
## 18 Connecticut 7984. 11397. 3414.
## 19 Delaware 8353. 11676. 3323.
## 20 New Jersey 10054. 13303. 3249.
## 21 Kansas 5345. 8530. 3185.
## 22 Nevada 3621. 6667. 3046.
## 23 Pennsylvania 10394. 13395. 3001.
## 24 Oklahoma 4454. 7450. 2995.
## 25 Oregon 6579. 9371. 2793.
## 26 Massachusetts 8863. 11588. 2725.
## 27 Texas 6395. 9117. 2722.
## 28 Minnesota 8144. 10831. 2688.
## 29 West Virginia 4575. 7171. 2596.
## 30 South Dakota 5479. 8055. 2576.
## 31 Maine 7058. 9573. 2515.
## 32 Florida 3848. 6360. 2512.
## 33 North Carolina 4493. 6973. 2480.
## 34 Idaho 4525. 6818. 2294.
## 35 Alaska 4328. 6571. 2243.
## 36 Wisconsin 6575. 8815. 2240.
## 37 Utah 4125. 6363. 2237.
## 38 Mississippi 5029. 7147. 2118.
## 39 Arkansas 5772. 7867. 2095.
## 40 North Dakota 5804. 7688. 1884.
## 41 Indiana 7368. 9120. 1752.
## 42 Nebraska 5947. 7608. 1661.
## 43 New Mexico 4926. 6355. 1429.
## 44 New York 6235. 7644. 1409.
## 45 Missouri 7477. 8564. 1087.
## 46 Iowa 6813. 7877. 1064.
## 47 Wyoming 4086. 4891. 805.
## 48 Montana 5630. 6351. 721.
## 49 Maryland 8531. 9163. 633.
## 50 Ohio 10378. 10196. -182.
Question 5: Let’s get a closer look at a few states and their tuition trends. Explore the data and choose five states that have an interesting pattern. Justify why you chose these states. Answer: I chose the top 5 states with the highest increase in tuition when I calculated the absolute value. I felt this would be interesting to look at because I wanted to see how close each state was to each other’s increase tuition average. If I were a data analyst, I would also look at the lowest 5 to see if there is a relationship between the top and bottom 5. For now, I looked at the top 5 to yield comparisons between.
top_five_avg_state_tuition <- tuition_long %>%
filter(State %in% c("Hawaii", "Arizona", "Colorado", "Illinois", "New Hampshire"))
top_five_avg_state_tuition$year <- as.numeric(top_five_avg_state_tuition$year)
ggplot(top_five_avg_state_tuition, aes(x = year, y = tuition, color = State))+
geom_line()
BONUS QUESTION: Show the tuition over time for all 50 states all in one image, with each state in its own little graph. Make the line for the state that you are from (or pick a random state if you’re not from the US) a different color than all the other states. Answer: I am from New Jersey, so that is the red state on the graph below.
ggplot(tuition_long, aes(x=year, y= tuition, color = State))+
geom_line()+
scale_color_manual(values = c("New Jersey" = "red")) +
labs(title = "National Tuition Through 2004-2015", x="Year", y= "Tuition") +
facet_wrap(~State, nrow = 5)+
theme_minimal()+
theme(axis.text.x = element_text(angle = 90, hjust = 0.5),
strip.text.x = element_text(size=6))
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?