Problem 1: Download new Trump approval data using the url for the 538 website: https://projects.fivethirtyeight.com/trump-approval-data/approval_topline.csv. Be sure to download it from this link directly into R - do not download the .csv then load from your files. Use read_csv() and just paste in the url surrounded by “quotes”! Plot the approval (approve_estimate) of the President using data that aggregates all polls (check the subgroup variable). It might also be helpful to add the disapproval estimate to the same plot for extra context. Answer: Before I do anything else, I load in my tiduverse, scales, lubricate, ggplot2, and readr using the library() function for each. Then I told R to read the two pieces of new data that we will need throughout the assignment. You can see below that I created two new dataframes called trump_approval_ratings and new_trump. You may notice that while I am ultimately telling R the same outcome, pulling in the information, the data is coming from two different places. Trump_approval_ratings is coming from a file in my raw data Dropbox folder, but the new_trump data is coming from a website, so my data is different to reflect its respective locations. Then for this part of the question, I needed the new_trump data so I piped that into the rest of the coding. I filtered the subgroup column of data to be All Polls then grouped it by modeldate. Then came the mathematical equations where I needed to find the mean of the approve estimate and disapprove estimate so I did that within a summarise, then mutated the modeldate to be in month, day, year format. Once I have all of this coded, I can start with the plot. Since I need two lines of data, one for approval and one for disapproval, I need geom_point twice, each mapping its own data. Once that is done, I need to do a scale_color_manual so each line is blue or red for its respective information, but also matches the legend. Then, I need to add titles and axis names so it can be easy to read based on just the graph alone. At the end of assigning the labs, I told R that the color signifies the variable (approval or disapproval). The next part is vital to securing the x axis element legibility. To do this I did the scale_x_date and add in breaks for each year. Then to finish the plot, I adjust each text part of the plot to the desired size and color where needed. Data Analysis: For this data analysis, I need to read the plot. In the beginning of January 2017 the approval and disapproval averages were fairly close to each other, but that did not last long into the new year of 2017. His approval rate decreases and his disapproval rate increased which could be due to the rough start of his presidency. We will get into this part of the analysis at the end. We see a slight increase in his approval and slight decrease in disapproval half way into the year of 2018, which other than superficial fluctuations remains static the remainder of his presidency. NOTE: by inferring this graph is showing a static fluctuation in approval and disapproval rates is not to say there was not any changes because this data is always moving up or down, but there are no significant drops or its adverse the remainder of the time displayed on the graph (January 2021). One part of the graph that I want to bring attention to is the year of 2017 and the beginning of 2018 where we see that his approval rate went down consequentally making his disapproval rate increase. These types of fluctations is usually the responsibility of several factors. First, there was his attempt to repeal and replace the healthcare legislation put into affect by Obama, which failed. Additionally, he signed an executive order preventing people from entering the country if immigrating from seven muslim countries. There were multiple avenues and angles that can be taken as to why he became more disliked during the first year of his presidency. As we can see from the plot, he gained back his followers trusts during the remainder of his presidency. This can also be analyzed! Further into his presidency, the pandemic started, which also started the stimulus checks. While people were not thrilled about the pandemic, they were thrilled to be receiving the checks in the mail to help provide for their families. Of course these are not the only reasons, but not all data can be analyzed from this plot. We cannot depict the personal reasons people had for their side; however, we can look at these key events (and others) to develop an understanding as to why the data looks the way that it does. —
trump_approval_ratings <- read_csv("~/Dropbox/DATA101/Data/Raw/538_Trump_Approval.csv")
## Rows: 11667 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (12): president, subgroup, modeldate, startdate, enddate, pollster, grad...
## dbl (9): samplesize, weight, influence, approve, disapprove, adjusted_appro...
## lgl (1): tracking
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
new_trump <- read_csv("https://projects.fivethirtyeight.com/trump-approval-data/approval_topline.csv")
## Rows: 4377 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): president, subgroup, modeldate, timestamp
## dbl (6): approve_estimate, approve_hi, approve_lo, disapprove_estimate, disa...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
new_trump %>%
filter(subgroup == "All polls") %>%
group_by(modeldate) %>%
summarise(approve_estimate_average = mean(approve_estimate),
disapprove_estimate_average = mean(disapprove_estimate))%>%
mutate(modeldate = mdy(modeldate))%>%
ggplot() +
geom_point(mapping = aes(x = modeldate, y = approve_estimate_average, color = "Approval Estimate")) +
geom_point(mapping = aes(x = modeldate, y = disapprove_estimate_average, color = "Disapproval Estimate")) +
scale_color_manual(values = c("Approval Estimate" = "blue", "Disapproval Estimate" = "red")) +
labs(title = "Trump Approval and Disapproval Estimates",
x = "Dates",
y = "Estimate Averages",
color = "Estimate Type") +
scale_x_date(date_breaks = "year", date_labels = "%b %Y")+
theme(axis.text.x = element_text(size = 8)) +
theme(legend.position = c("bottom"),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10),
axis.text.y = element_text(size = 10, color = "black"),
text = element_text(size = 12, color = "black"))
Question 1a Next, pick a period where approval ratings are out of the ordinary, or display some sort of unusual trend. Explain this trend using the events of that time period. For example, approval rating in January of 2019 was lower than normal due to the government shutdown. Data Analysis: In the beginning of the year 2017, Trump had made some decisions that created controversy; meaning, there were people who agreed and some disagreed with his decisions. As we can see from the graph, we are noticing that the overwhelming majority were not favorable. This could be due to multiple things, some examples are banning migration from seven Muslim countries, failing to repeal the healthcare legislation put in place by Obama. Additionally, he continued to advocate against some demographics which would continue to drive his disapproval ratings higher. While we cannot know all the reasons individual citizens were disliking of Trump, we can argue that his political decisions and morals were tied to his decrease in approval ratings.
Question 2: In the flights data, are delays more likely on certain days of the week? This requires knowing how many flights leave each day.
Answer: The very first thing that I need to do is tell R that I need it to pull my flights dataset. I did this by running this line of code: library(nycflights13). Now my code may look a tiny bit different because I ran into a problem with my data. Somehow information got mixed together throughout the course, so the “::” that is in the lines following is the correction for the data. With that problem resolved, I was able to move forward with the coding for this question. First, I want to be able to distinguish between the delays on each day of the week, so I mutate my new variable called “day_of_week”. From there, I need to create another variable that is representative of the delay statistics, which I called “delay_stats”. From there I code the mathematical equations and the range for the delay. Then, created the total_delay_flights which is representative of the sum of all delays. Once all of this coding was done, I was able to move onto plotting the data. I tell R that the data is going to be coming from the delay_stats data with my x variable being days of the week and my y variable being the mean delay that was calculated above. In the error bar function, I am telling R that the standard deviation is the mean of delay and setting the range. Then I enter in the labs so everything has titles including the x and y axis..
Data Analysis: Based on the bar graph, we can see that the day with the highest chances of a flight delay is Friday, barely being longer than Monday. Interestingly, Friday has the highest chance of a delay and has the most flights. While we can be quick to assume that this is a direct correlation, we have no way to determine cause and effect (at least from this graph alone). Continuing to solidify the possibility of a causal relationship, the day with the least amount of flights also has the smallest delay. Hmm… this is an interesting finding, one that could be pursued further if planning a trip.
flights <- nycflights13::flights
flights <- flights %>%
mutate(day_of_week = wday(dep_time, label = TRUE))
delay_stats <- flights %>%
drop_na()%>%
group_by(day_of_week) %>%
summarise(
mean_delay = mean(dep_delay, na.rm = TRUE),
median_delay = median(dep_delay, na.rm = TRUE),
max_delay = max(dep_delay, na.rm = TRUE),
min_delay = min(dep_delay, na.rm = TRUE),
total_delayed_flights = sum(dep_delay > 0, na.rm = TRUE)
)
ggplot(delay_stats, aes(x = day_of_week, y = mean_delay)) +
geom_col() +
geom_errorbar(aes(ymin = mean_delay - sd(mean_delay), ymax = mean_delay + sd(mean_delay)),
width = 0.2, color = "blue", alpha = 0.7) +
labs(title = "Mean Delay by Day of the Week",
x = "Day of the Week",
y = "Mean Delay (Minutes)") +
theme_minimal()
glimpse(delay_stats)
## Rows: 7
## Columns: 6
## $ day_of_week <ord> Sun, Mon, Tue, Wed, Thu, Fri, Sat
## $ mean_delay <dbl> 78.84000, 97.25714, 67.61538, 62.26923, 81.85000…
## $ median_delay <dbl> 6, 72, 4, 9, 75, 71, 8
## $ max_delay <dbl> 271, 302, 298, 334, 225, 486, 277
## $ min_delay <dbl> 2, 3, 4, 5, 6, 7, 8
## $ total_delayed_flights <int> 25, 35, 26, 26, 20, 22, 22
Question 3: Solve problem 16.3.4.4 in the textbook: How does the average delay time change over the course of a day? Should you use dep_time or sched_dep_time? Why?
Answer: First, I am telling R that I am converting the times needed to complete the graph that I am trying to create. I do this by using the functions in lubridate, which allows me to alter the formatting of the time. Once I use all of the functions needed for the conversions, I can begin to look at variables I need to use to accomplish my outcome goal. For this I wanted to specifically look at arrival delay and the scheduled departure time because this would give me insight into how long the delay for each plane was. Once I tell R that I want to use these variables, I am able to plot the outcome.
Data Analysis: As the day continues, beginning at 5:00 am, the average flight delays increase throughout the day. It reaches a peak at 6:00 pm where for the first time during the day, the delay reduces. In other words, as the day continues there is a greater average delay amongst the flights. This could be because of weather changes throughout the day or other factors not analyzed in this graph (location, number of passengers, plane age, and so forth). All we can analyze from this graph is that the later in the day the flight is, the more delay there is the potential for. If you want your least chances of dealing with a delay, then fly in the early hours of the morning, or the super late hours of the evening when the delay averages are beginning to decrease at 6:00 pm.
make_datetime_100 <- function(year, month, day, time) {
make_datetime(year, month, day, time %/% 100, time %% 100)
}
flights_dt <- flights %>%
filter(!is.na(dep_time), !is.na(arr_time)) %>%
mutate(
dep_time = make_datetime_100(year, month, day, dep_time),
arr_time = make_datetime_100(year, month, day, arr_time),
sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
) %>%
select(origin, dest, ends_with("delay"), ends_with("time"))
sched_dep <- flights_dt %>%
mutate(minute = minute(sched_dep_time)) %>%
group_by(minute) %>%
summarise(
avg_delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
flights_dt %>%
mutate(sched_dep_hour = hour(sched_dep_time)) %>%
group_by(sched_dep_hour) %>%
summarise(dep_delay = mean(dep_delay)) %>%
ggplot(aes(y = dep_delay, x = sched_dep_hour)) +
geom_point() +
geom_smooth() +
labs(title = "Average Delay Over the Day",
x = "Hour of the Day",
y = "Departure Delay") +
theme()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Question 4: In addition to delays, there are also cancellations. When, during the day, is your flight most likely to be cancelled? Group the flights into 15 minute intervals, and then plot when your flight is most likely to be cancelled. For simplicity, let’s pool days/weeks etc. together. For simplicity, start the day at 5 AM (since effectively no flights are scheduled to depart before then). Hint: the rounding functions and update will be helpful in answering this question.
Answer: First, I begin by making a new variable to use my lubridate to convert the time, then create another variable for cancelled flights using arrival time because if the arrival time is 0, then the flight was cancelled. I binned the data, meaning I created the 15 minute intervals. Then I began the plotting process and adding my labels.
Data Analysis: Throughout the day, a flight is most likely to be cancelled at around 6:30 pm, but then the likelihood drops quickly. There are spikes throughout the day where the chance of a cancellation increases, but then comes back down. We can see from the graph that the later in the day it is, the more likely the cancellation will occur. Beginning in the morning, there are very little chances of having a flight cancellation; however, it spikes at 6:00 am then remains to increase and drop over the next several hours as we start to see the increase be higher than the drop.
make_datetime_100 <- function(year, month, day, time) {
make_datetime(year, month, day, time %/% 100, time %% 100)
}
flights_cancelled <- flights %>%
filter(is.na(dep_time), !is.na(sched_dep_time)) %>%
mutate(
sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
can_day = update(sched_dep_time, yday = 1),
binned_can = floor_date(can_day, minutes(15))
)
ggplot(flights_cancelled, aes(x = binned_can)) +
geom_histogram(binwidth = 900) +
labs(x = "Scheduled Time of Departure",
y = "Number of Cancelled Flights",
title = "Cancelled Flights Over the Day")
Question 5: Download stock data for the last few years from Apple (AAPL) and Microsoft (MSFT) using the code below, grabbing the stock prices by day using the tidyquant package. This package allows you to super easily pull stock price data into R directly from Yahoo Finance’s API. Cool, right? install.packages(“tidyquant”) library(tidyquant) prices <- tq_get(c(‘AAPL’,‘MSFT’), from = “2017-01-01”, to = “2020-04-17”, get = “stock.prices”) Answer: This coding was copied from the homework as directed in the directions. To answer the last part, yes this is cool that I was able to load in so much information with such a little but of code. Makes me wonder what else we can do that we have not gotten to this course.
prices <- tq_get(c('AAPL','MSFT'),
from = "2017-01-01",
to = "2020-04-17",
get = "stock.prices")
Question 5a: Plot the Apple and Microsoft stock prices over time - what trends can you notice? Are there real-world events that correlate to spikes or dips in price? Feel free to use the open, high, low, close, or adjusted prices for your plot. Just make sure to label your chart appropriately so we know what we’re looking at! Answer: For this code, I need to assign the symbols so R knows what I am referring to. Then I can plot all the information installed from the previous data that was brought in. Then I do a similar code to the first question to get two different lines, one referring to Apple and one referring to Microsoft. Then I make all of my labels. Data Analysis: From the graph, I can see that the stock prices of Microsoft from 2020 is higher than Apple and remains higher than Apple to this year. Additionally, since the start of 2023, Microsoft has seen a significant increase in their stock prices. Apple has been lower than Microsoft and has remained that way. While Apple, overall, has increased in their stock prices, they have had some drops in 2021 and at the beginning of 2023. In 2023, Microsoft has seen an increase in their stocks since purchased a controlling interest share in Open AI, which has been beneficial. Open AI is a privately traded technology company that run Chat GPT. For Apple, they have had some new devices come to the market which could benefit their stocks; however, they have not come out with anything surprising that would have people running to the stores. Their newest device, which is not going to be significant on this graph, is unique; however, not within the pricing for less than high class. This could be factoring into their stocks not taking as much of an increase as Microsoft.
getSymbols(c("AAPL", "MSFT"), from = "2020-01-01", to = Sys.Date())
## [1] "AAPL" "MSFT"
ggplot() +
geom_line(aes(x = index(AAPL), y = AAPL$AAPL.Adjusted, color = "Apple"), size = 1) +
geom_line(aes(x = index(MSFT), y = MSFT$MSFT.Adjusted, color = "Microsoft"), size = 1) +
labs(title = "Apple and Microsoft Stock Prices Over Time",
x = "Date",
y = "Adjusted Closing Price",
color = "Stock") +
scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Don't know how to automatically pick scale for object of type <xts/zoo>.
## Defaulting to continuous.
Question 5b: Now, pick two other stocks and look at their performance over time from the beginning of February 2020 until now. Try to pick one stock that you would expect has gone up in price over this time period and one that you would expect has gone down over this time period and explain why that might be the case. You can search various companies in the search bar on the Yahoo Finance site and find out what their stock symbols are: https://finance.yahoo.com/ Answer: First, I make my symbols that I can use within the coding, then I can get straight into plotting this information. I use two geom_line() twice to make two lines one for Amazon and one for Exxon Mobil. Then I add in all of my titles to make the graph easy to read. Data Analysis: I thought Amazon stock prices would be flourishing since online shopping has increased since the pandemic; however, after the pandemic their stocks has decreased. Now, the stock prices are on the rise, but based on the data I am seeing, I would not be surprised if their stock price comes crashing down again. I would have thought Exxon Mobil would have decreased since fuel prices have decreased. Exxon Mobil have continued to increase over the passed 4 years.
getSymbols(c("AMZN", "XOM"), from = "2020-02-01", to = Sys.Date())
## [1] "AMZN" "XOM"
ggplot() +
geom_line(aes(x = index(AMZN), y = AMZN$AMZN.Adjusted, color = "Amazon"), size = 1) +
geom_line(aes(x = index(XOM), y = XOM$XOM.Adjusted, color = "Exxon Mobil"), size = 1) +
labs(title = "Amazon and Exxon Mobil Stock Prices Over Time",
x = "Date",
y = "Adjusted Closing Price",
color = "Stock") +
scale_x_date(date_breaks = "6 months", date_labels = "%Y-%m") +
theme_minimal()
## Don't know how to automatically pick scale for object of type <xts/zoo>.
## Defaulting to continuous.
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(readr)
library(lubridate)
library(nycflights13)
library(tidyquant)
library(quantmod)
library(scales)
trump_approval_ratings <- read_csv("~/Dropbox/DATA101/Data/Raw/538_Trump_Approval.csv")
new_trump <- read_csv("https://projects.fivethirtyeight.com/trump-approval-data/approval_topline.csv")
new_trump %>%
filter(subgroup == "All polls") %>%
group_by(modeldate) %>%
summarise(approve_estimate_average = mean(approve_estimate),
disapprove_estimate_average = mean(disapprove_estimate))%>%
mutate(modeldate = mdy(modeldate))%>%
ggplot() +
geom_point(mapping = aes(x = modeldate, y = approve_estimate_average, color = "Approval Estimate")) +
geom_point(mapping = aes(x = modeldate, y = disapprove_estimate_average, color = "Disapproval Estimate")) +
scale_color_manual(values = c("Approval Estimate" = "blue", "Disapproval Estimate" = "red")) +
labs(title = "Trump Approval and Disapproval Estimates",
x = "Dates",
y = "Estimate Averages",
color = "Estimate Type") +
scale_x_date(date_breaks = "year", date_labels = "%b %Y")+
theme(axis.text.x = element_text(size = 8)) +
theme(legend.position = c("bottom"),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10),
axis.text.y = element_text(size = 10, color = "black"),
text = element_text(size = 12, color = "black"))
flights <- nycflights13::flights
flights <- flights %>%
mutate(day_of_week = wday(dep_time, label = TRUE))
delay_stats <- flights %>%
drop_na()%>%
group_by(day_of_week) %>%
summarise(
mean_delay = mean(dep_delay, na.rm = TRUE),
median_delay = median(dep_delay, na.rm = TRUE),
max_delay = max(dep_delay, na.rm = TRUE),
min_delay = min(dep_delay, na.rm = TRUE),
total_delayed_flights = sum(dep_delay > 0, na.rm = TRUE)
)
ggplot(delay_stats, aes(x = day_of_week, y = mean_delay)) +
geom_col() +
geom_errorbar(aes(ymin = mean_delay - sd(mean_delay), ymax = mean_delay + sd(mean_delay)),
width = 0.2, color = "blue", alpha = 0.7) +
labs(title = "Mean Delay by Day of the Week",
x = "Day of the Week",
y = "Mean Delay (Minutes)") +
theme_minimal()
glimpse(delay_stats)
make_datetime_100 <- function(year, month, day, time) {
make_datetime(year, month, day, time %/% 100, time %% 100)
}
flights_dt <- flights %>%
filter(!is.na(dep_time), !is.na(arr_time)) %>%
mutate(
dep_time = make_datetime_100(year, month, day, dep_time),
arr_time = make_datetime_100(year, month, day, arr_time),
sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
) %>%
select(origin, dest, ends_with("delay"), ends_with("time"))
sched_dep <- flights_dt %>%
mutate(minute = minute(sched_dep_time)) %>%
group_by(minute) %>%
summarise(
avg_delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
flights_dt %>%
mutate(sched_dep_hour = hour(sched_dep_time)) %>%
group_by(sched_dep_hour) %>%
summarise(dep_delay = mean(dep_delay)) %>%
ggplot(aes(y = dep_delay, x = sched_dep_hour)) +
geom_point() +
geom_smooth() +
labs(title = "Average Delay Over the Day",
x = "Hour of the Day",
y = "Departure Delay") +
theme()
make_datetime_100 <- function(year, month, day, time) {
make_datetime(year, month, day, time %/% 100, time %% 100)
}
flights_cancelled <- flights %>%
filter(is.na(dep_time), !is.na(sched_dep_time)) %>%
mutate(
sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
can_day = update(sched_dep_time, yday = 1),
binned_can = floor_date(can_day, minutes(15))
)
ggplot(flights_cancelled, aes(x = binned_can)) +
geom_histogram(binwidth = 900) +
labs(x = "Scheduled Time of Departure",
y = "Number of Cancelled Flights",
title = "Cancelled Flights Over the Day")
prices <- tq_get(c('AAPL','MSFT'),
from = "2017-01-01",
to = "2020-04-17",
get = "stock.prices")
getSymbols(c("AAPL", "MSFT"), from = "2020-01-01", to = Sys.Date())
ggplot() +
geom_line(aes(x = index(AAPL), y = AAPL$AAPL.Adjusted, color = "Apple"), size = 1) +
geom_line(aes(x = index(MSFT), y = MSFT$MSFT.Adjusted, color = "Microsoft"), size = 1) +
labs(title = "Apple and Microsoft Stock Prices Over Time",
x = "Date",
y = "Adjusted Closing Price",
color = "Stock") +
scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
theme_minimal()
getSymbols(c("AMZN", "XOM"), from = "2020-02-01", to = Sys.Date())
ggplot() +
geom_line(aes(x = index(AMZN), y = AMZN$AMZN.Adjusted, color = "Amazon"), size = 1) +
geom_line(aes(x = index(XOM), y = XOM$XOM.Adjusted, color = "Exxon Mobil"), size = 1) +
labs(title = "Amazon and Exxon Mobil Stock Prices Over Time",
x = "Date",
y = "Adjusted Closing Price",
color = "Stock") +
scale_x_date(date_breaks = "6 months", date_labels = "%Y-%m") +
theme_minimal()