knitr::opts_chunk$set(echo = TRUE, error = TRUE, warning = FALSE, message = FALSE)
library(tidyverse)
library(readxl)
library(lubridate)
library(psych)
library(kableExtra)
Instructions:

1) Identify an image, meme, or graph that was misleading. What is the topic of the example? Provide necessary background information to explain why this topic should matter to the reader.  
2) How did they come up with their disputed statement or image? Why was it misleading? (missing context, unequal baselines, incorrect calculations, misleading axis, etc.) Was it intentionally misleading or an honest mistake? Support your argument.  
3) Where did you find data or information to fact check their statement? Include the source(s) and how the data were collected. What makes this source a reputable source? How would you convince someone else that this source can be trusted more than the original source? If coming from the same source, what makes your version of the data visualization more honest or clear?  
4) Download the data needed to fact check or recreate the original image. Is the file tidy? Check that data are correctly organized (e.g., look at the first few rows, check rows and column numbers as you move to one step to the other, and so on). Make sure to identify the unit of analysis.  
5) Document what steps were necessary to clean your data. Your project should be reproducible. If I wanted to replicate your findings, I should be able to find the data, clean it, and analyze it based off of the information provided in the report.  The code should be well annotated and provide adequate comments to each phase of the analysis.
6) Identify the types of variables being used and include appropriate descriptive statistics. You must tie in descriptive statistics concepts from the course when exploring your data.
7) Explore your variables both by looking at summary statistics and visualizing them on exploratory plots (e.g., histograms, density plots, bar plots, boxplot…). You should focus on variables that are of interest to you based on your topic.
8) Calculate the statistics necessary to support your argument in R. Explain why you used certain methods or statistics and why they may be a better way to communicate accurate information.  
9) Create at least one explanatory graph with a title, labels, etc. that supports your argument.  
10) Create at least one summary table with a title, headers, footnotes, etc.  

Almost all of Texas is on its own power grid, while the rest of the United States is divided into a Western and Eastern boards. Many of the structural redundancies to protect against power outages either don’t exist for the ERCOT region or haven’t been recently invested in. There is a lot of background information on how the power supply is essentially a monopoly and can control prices and supply. During the winter storm Uri in February 2021, much of Texas experienced a cold streak and the demand for heat and energy sources skyrocketed (as did the prices). Demand increased so much that ERCOT issued rolling blackouts to balance the power supply and demand. Because of the freezing temperatures, many energy sources experienced difficulties and failures. During the storm, a lot of blame was placed on wind turbines and green energy sources. This report will examine the power outputs from each energy source during the cold streak to determine if blaming green energy sources was correct. Interconections

Texas Governor Greg Abbott claimed that solar and wind were to blame for the power outages and that fossil fuels were the only reliable way to power electric grids. Other conservative leaders and influential talking heads said similar things across various forms of media. Other’s claimed that, “Unbeknownst to most people, the Green New Deal came to Texas, the power grid in the state became totally reliant on windmills,” Carlson said Feb. 16. “Then it got cold, and the windmills broke, because that’s what happens in the Green New Deal.” Carlson also warned that “the same energy policies that have wrecked Texas this week are going nationwide.”

Add more context and whatnot

Some of the images that inspired me:

Which source failed?

Show change over time for each energy source. Maybe use a double y-axis to add temperature change over time; must be used with caution. Graphs with double axes have been used to misrepresent data because the two y axes do not share a common scale. Show that proportionally, natural gas decreased the largest amount (in raw numbers and percentage) and that “green energy sources” were relatively reliable during the freezing weather. Calculate percent of total energy that comes from each source, similar to HW 1.

Reminder: Leave a space between your headings’ ## and your words so that it formats correctly!

Not part of the project but just interesting: Barely Related Link: Author looked at the additional deaths that occured from the storm. Includes analysis and R Code.

Data Preparation

The ERCOT power data is publicly available from ERCOT’s website and the EIA’s energy tracking website. Information is collected hourly on megawatt amounts from multiple categories of energy sources.

Originally, I downloaded the data as a massive CSV file and kept only the data I needed. Later, I figured out how to use an API to make the data collection reproducible for anyone to download (after they get a unique EIA API key; see Methodology section). Date ranges included in the dataset go from January 1st to August 13th, the day the data was originally downloaded using the API.

Do echo = FALSE if you do not want the code to appear in the document with the output.

The data for February power sources in Texas originally contained 5,382 rows, where each row represented an hourly update on power output. Dates begin at January 1st and end at August 13th (when data was originally downloaded). Columns consisted of a time stamp and categories of energy (solar, wind, coal, etc.) Originally, the structure was a wide format where each energy source was it’s own variable, however, I wanted to transform my data into a longer format that only contains data from February 1st to February 28th where all the energy sources become one variable named “source”. In order to do this, I used pivot_longer() and filtered for a range of dates.

Texas_wide also does not have the dates/times separated. They are in one variable named date. For summarizing and graphing purposes, I may want to have date and time as separate variables and rename the original variable to datetime since it contains both items.

Texas_feb <- Texas_wide %>% 
  arrange(date) %>%
  filter(date >= as.Date("2021-02-01") & date <= as.Date("2021-02-28") ) %>%
  rename(datetime = date) %>%
  mutate(date = as_date(datetime),      # a new column appeared! stored as date
         time = hms::as_hms(datetime) ) # stored as S3:hms (format for time)

Texas_long <- Texas_wide %>% 
  arrange(date) %>%
  filter(date >= as.Date("2021-02-01") & date <= as.Date("2021-02-28") ) %>%
  pivot_longer(!date,  
               names_to = "source",
               values_to = "power") %>%
  rename(datetime = date) %>%
  mutate(date = as_date(datetime),      # a new column appeared! stored as date
         time = hms::as_hms(datetime) ) # stored as S3:hms (format for time)

str(Texas_long)
## tibble [3,894 x 5] (S3: tbl_df/tbl/data.frame)
##  $ datetime: POSIXct[1:3894], format: "2021-02-01 00:00:00" "2021-02-01 00:00:00" ...
##  $ source  : chr [1:3894] "Natural Gas" "Wind" "Coal" "Solar" ...
##  $ power   : num [1:3894] 13969 8957 7845 1697 50 ...
##  $ date    : Date[1:3894], format: "2021-02-01" "2021-02-01" ...
##  $ time    : 'hms' num [1:3894] 00:00:00 00:00:00 00:00:00 00:00:00 ...
##   ..- attr(*, "units")= chr "secs"

My new data frame containing information only for February, Texas_long, has 3,894 rows and 5 columns. Each row represents an hourly output from a specific energy source (i.e. there are multiple rows each hourly update).

  • datetime stored as POSIXct
  • source stored as character
  • power stored as numeric
  • date stored as date
  • time stored as hour-min-seconds

Temperatue

temp %>%
  ggplot() +
  geom_line(aes(date, temp_daily_avg)) +
  theme_bw()

Descriptive Statistics

  • Describe the data: How many observations? How many variables are there? What kind of variables are they (categorical, continuous, or ordinal)? How are they stored when you read your data into R? Was your data tidy? Do you want it in long or wide format (or both)?

  • What are the main variables you are using for your example? Provide appropriate descriptive statistics given the variable type. (Range, median, mean, distribution shape, count, etc.).

Do not just run a command and include the output. Interpret/summarize the key statistics in a sentence or two! Communicate to the reader!

summary(Texas_feb)
##     datetime                    Natural Gas         Wind            Coal      
##  Min.   :2021-02-01 00:00:00   Min.   : 4758   Min.   :  649   Min.   : 3873  
##  1st Qu.:2021-02-07 18:00:00   1st Qu.:11625   1st Qu.: 4409   1st Qu.: 6964  
##  Median :2021-02-14 12:00:00   Median :16479   Median : 7416   Median : 8855  
##  Mean   :2021-02-14 12:00:00   Mean   :20087   Mean   : 8608   Mean   : 8446  
##  3rd Qu.:2021-02-21 06:00:00   3rd Qu.:29812   3rd Qu.:12358   3rd Qu.:10477  
##  Max.   :2021-02-28 00:00:00   Max.   :43967   Max.   :22415   Max.   :11693  
##      Solar            Water           Nuclear          date           
##  Min.   :   0.0   Min.   : 43.00   Min.   :3780   Min.   :2021-02-01  
##  1st Qu.:   0.0   1st Qu.: 48.00   1st Qu.:5100   1st Qu.:2021-02-07  
##  Median :   2.0   Median : 72.00   Median :5115   Median :2021-02-14  
##  Mean   : 981.1   Mean   : 79.76   Mean   :4958   Mean   :2021-02-14  
##  3rd Qu.:1680.0   3rd Qu.: 97.00   3rd Qu.:5136   3rd Qu.:2021-02-21  
##  Max.   :4957.0   Max.   :343.00   Max.   :5149   Max.   :2021-02-28  
##      time         
##  Length:649       
##  Class1:hms       
##  Class2:difftime  
##  Mode  :numeric   
##                   
## 
Texas_feb %>%
  select(`Natural Gas`:Nuclear) %>%
describe(fast = TRUE)
ggplot(Texas_long) +
  geom_line(aes(datetime, power, color= source)) + 
  theme_classic()

The exploratory graph above shows the amount of power produced from each energy source from February 1st to February 28th.

Texas_long %>%
  group_by(source)%>%
  summarize(feb_sum = sum(power)) %>%
ggplot(aes(source, feb_sum) ) + 
  geom_col() + 
  theme_minimal()

Monthly Summary Table

The simple table below shows how many megawatts each source produced during the month of February as a raw count of Megawatts and a percentage of total output from all energy sources:

Texas_long %>%
  group_by(source) %>%
  summarize(Megawatts = sum(power)) %>%
  mutate(Percent = round(prop.table(Megawatts), digits = 3)) 

Percentage Change

Not complete. Ideally one table that shows the difference between before the storm and during the cold streak where the power output dropped and show how much each changed.

HourlyChange <- Texas_long %>%
  group_by(source) %>% 
  arrange(datetime, .by_group = TRUE) %>%
  mutate(pct_change = (power/lag(power) - 1) * 100)


Texas_long %>%
  filter(date == as.Date("2021-02-15")) %>%
  group_by(source) %>% 
  summarize(dailysum15 = sum(power)) %>%
  mutate(Percent = round(prop.table(dailysum15), digits = 3))
Texas_long %>%
    filter(date == as.Date("2021-02-25")) %>%
  group_by(source) %>%
  summarize(Megawatts = sum(power)) %>%
  mutate(Percent = round(prop.table(Megawatts), digits = 3))

Recoding

“Thermal unit category” includes natural gas, coal and nuclear power. Recode using this definition and green energy sources just for fun.

Texas_long %>%
  mutate(
    Energy = case_when(
      source == "Wind" | source == "Water" | source =="Solar" ~ "Green",
      source == "Coal" | source == "Nuclear" | source == "Natural Gas" ~ "ThermalUnit") ) %>%
  group_by(Energy) %>%
  summarize(Megawatts = sum(power))

Correlation

Discuss comments on correlations. What goes up the most as the other goes down? Throw that into the discussion of the topic and assumptions that were made at the time.

cor(Texas_feb$Wind, Texas_feb$`Natural Gas`) # 2 variables
## [1] -0.7450777
# correlation matrix
Texas_feb %>%
    select(`Natural Gas`:Nuclear) %>%
    cor(use = "pairwise.complete.obs")
##             Natural Gas        Wind        Coal        Solar        Water
## Natural Gas   1.0000000 -0.74507769  0.54098879 -0.179418037  0.473690492
## Wind         -0.7450777  1.00000000 -0.53655719 -0.063957320 -0.276937349
## Coal          0.5409888 -0.53655719  1.00000000 -0.038273160  0.155804984
## Solar        -0.1794180 -0.06395732 -0.03827316  1.000000000  0.009598617
## Water         0.4736905 -0.27693735  0.15580498  0.009598617  1.000000000
## Nuclear      -0.3271192  0.40138951  0.14874660  0.034718663 -0.281910443
##                 Nuclear
## Natural Gas -0.32711919
## Wind         0.40138951
## Coal         0.14874660
## Solar        0.03471866
## Water       -0.28191044
## Nuclear      1.00000000

Communicating Implications / Conclusion

In the early morning hours of Feb. 15, natural gas generation dropped 23% by 4 a.m., a total of about 10,000 megawatts on a system that was running about 65,000 megawatts in total at midnight. That morning ERCOT started rolling blackouts.

So, it’s true that wind plays a significant role in Texas’ power supply — the state actually generates more wind energy than any other state in the nation — but there’s no indication that wind energy was the primary cause of the power outages in Texas.

Blah blah blah add more stuff that relates to your graphs and tables to support your argument.

Sources

Energy Information Administration (EIA) The Hourly Electric Grid Monitor

“Wind Turbines Didn’t Cause Texas Energy Crisis” FactCheck.org

“How Fox News, far-right TV blamed green energy for Texas’ power outages” Politifact

Methodology

#' Texas's Energy Output
#'
#' Energy output (megawatthours) from each source for Texas during the winter storm.
#' Data are from the U.S. Energy Information Administration
#' You will need YOUR OWN api eia key to download the data through their API
#' Get an API key here: https://www.eia.gov/developer/
#' Check out the Hourly Electric Grid Monitor here: https://www.eia.gov/electricity/gridmonitor/dashboard/electric_overview/US48/US48
#' @format A data frame with 5382 rows and 7 variables:
#' \describe{
#'   \item{date}{Date and time of measurement}
#'   \item{Natural Gas}{Supply of energy in megawatthours from Natural Gas}
#'   \item{Wind}{}
#'   \item{Coal}{}
#'   \item{Solar}{}
#'   \item{Water}{}
#'   \item{Nuclear}{}

#' }
#' @source \url{https://www.eia.gov/}
library("eia") # works with API for downloading data from the EIA

## replicatable way
eia_set_key("c0817f67f7817ab45b9f7e8dbf0de9bb")

# Prep the responses
base_url = 'http://api.eia.gov/series/?series_id='

variables <- c("EBA.TEX-ALL.NG.NG.H", 
               "EBA.TEX-ALL.NG.WND.H", 
               "EBA.TEX-ALL.NG.COL.H", 
               "EBA.TEX-ALL.NG.SUN.H", 
               "EBA.TEX-ALL.NG.WAT.H", 
               "EBA.TEX-ALL.NG.NUC.H")

list <- eia_series(variables, start = 2021)
#Downloaded on August 13th. Last day of data included.

list$data[[3]] # 3,438 X 5 tibble

unlist <- unnest(list, cols = data)

unlist <- unnest(list, cols = data) %>%
  select(date, series_id, value) %>%
  pivot_wider(names_from = series_id,
              values_from = value)


storm <- unlist %>% rename("Natural Gas" = "EBA.TEX-ALL.NG.NG.H",
                            "Wind" = "EBA.TEX-ALL.NG.WND.H",
                            "Coal" = "EBA.TEX-ALL.NG.COL.H",
                            "Solar" = "EBA.TEX-ALL.NG.SUN.H",
                            "Water" = "EBA.TEX-ALL.NG.WAT.H",
                            "Nuclear" =  "EBA.TEX-ALL.NG.NUC.H")


load("C:/Users/aleaw/OneDrive/Desktop/cuppackage/data/storm.rda")

Texas_wide <- storm

Texas_feb <- Texas_wide %>% 
  arrange(date) %>%
  filter(date >= as.Date("2021-02-01") & date <= as.Date("2021-02-28") ) %>%
  rename(datetime = date) %>%
  mutate(date = as_date(datetime),      # a new column appeared! stored as date
         time = hms::as_hms(datetime) ) # stored as S3:hms (format for time)

Texas_long <- Texas_wide %>% 
  arrange(date) %>%
  filter(date >= as.Date("2021-02-01") & date <= as.Date("2021-02-28") ) %>%
  pivot_longer(!date,  
               names_to = "source",
               values_to = "power") %>%
  rename(datetime = date) %>%
  mutate(date = as_date(datetime),      # a new column appeared! stored as date
         time = hms::as_hms(datetime) ) # stored as S3:hms (format for time)

str(Texas_long)

load("C:/Users/aleaw/OneDrive/Desktop/cuppackage/data/texastemperature.rda")
as_tibble(texastemperature)
# date stored as  character, temp as double
#57 rows, 2 columns

temp <- as_tibble(texastemperature) %>%
  mutate(date = mdy(date)) %>%      # date now stored as date instead of character
  filter(date >= as.Date("2021-02-01") & date <= as.Date("2021-02-28") )  # filter Feb.
temp

temp %>%
  ggplot() +
  geom_line(aes(date, temp_daily_avg))
ggplot(Texas_feb, aes(Coal)) +
  geom_density()

Texas_feb %>% 
ggplot(aes(`Natural Gas`)) + 
  geom_histogram()

ggplot(Texas_feb, aes(Wind)) +
  geom_density()

ggplot(Texas_feb, aes(Nuclear)) +
  geom_density() # interesting. Probably because it's either off or on depending on need.

# To calculate the difference in energy output on February 
HourlyChange <- Texas_long %>%
  group_by(source) %>% 
  arrange(datetime, .by_group = TRUE) %>%
  mutate(pct_change = (power/lag(power) - 1) * 100)


Texas_long %>%
  filter(date == as.Date("2021-02-15")) %>%
  group_by(source) %>% 
  summarize(dailysum15 = sum(power)) %>%
  mutate(Percent = round(prop.table(dailysum15), digits = 3))


Texas_long %>%
    filter(date == as.Date("2021-02-25")) %>%
  group_by(source) %>%
  summarize(Megawatts = sum(power)) %>%
  mutate(Percent = round(prop.table(Megawatts), digits = 3))