Add message = FALSE, warning = FALSE so that extra output is not included in your final document. This will prevent the messages that come from loading packages from being included

In your setup chunk,you can also add error = TRUE so that the document will knit even if the code doesn’t work.

Instructions

  1. Identify an image, meme, or graph that was misleading. What is the topic of the example? Provide necessary background information to explain why this topic should matter to the reader.
  2. How did they come up with their disputed statement or image? Why was it misleading? (missing context, unequal baselines, incorrect calculations, misleading axis, etc.) Was it intentionally misleading or an honest mistake? Support your argument.
  3. Where did you find data or information to fact check their statement? Include the source(s) and how the data were collected. What makes this source a reputable source? How would you convince someone else that this source can be trusted more than the original source? If coming from the same source, what makes your version of the data visualization more honest or clear?
  4. Download the data needed to factcheck or recreate the original image and to recalculate the statistics. Is the file tidy? Check that data are correctly organized (e.g., look at the first few rows, check rows and column numbers as you move to one step to the other, and so on). Make sure to identify the unit of analysis.
  5. Document what steps were necessary to clean your data. Your project should be reproducible. If I wanted to replicate your findings, I should be able to find the data, clean it, and analyze it based off of the information provided in the report. The code should be well annotated and provide adequate comments to each phase of the analysis.
  6. Identify the types of variables being used and include appropriate descriptive statistics. You must tie in descriptive statistics concepts from the course when exploring your data.
  7. Explore your variables both by looking at summary statistics and visualizing them on exploratory plots (e.g., histograms, density plots, bar plots, boxplot…). You should focus on variables that are of interest to you based on your topic.
  8. Calculate the statistics necessary to support your argument in R. Explain why you used certain methods or statistics and why they may be a better way to communicate accurate information.
  9. Create at least one explanatory graph with a title, labels, etc. that supports your argument.
  10. Create at least one summary table with a title, headers, footnotes, etc.

Introduction

Make sure to include:

  • How topic was selected
  • Background information on topic
  • Image that inspired you
  • State what was wrong with previous image or statement and how you intend to improve it
  • Statistics topics from class tied into topic discussion

Don’t forget to leave a space between the # and the words for your headers!

Topic

Hyperlinks: See example in markdown file for the next sentence. Add [] around the words you want to be the link and () around the link.

The ERCOT power data is publicly available from ERCOT’s website and the EIA’s energy tracking website. Information is collected hourly on megawatt amounts from multiple categories of energy sources.

Policy Context: Texas Governor Greg Abbott claimed that solar and wind were to blame for the power outages and that fossil fuels were the only reliable way to power electric grids. Other conservative leaders and influential talking heads said similar things across various forms of media.

“Unbeknownst to most people, the Green New Deal came to Texas, the power grid in the state became totally reliant on windmills,” Carlson said Feb. 16. “Then it got cold, and the windmills broke, because that’s what happens in the Green New Deal.”

Carlson also warned that “the same energy policies that have wrecked Texas this week are going nationwide.”

Communication Plan:

Show change over time for each energy source. Maybe use a double y-axis to add temperature change over time; must be used with caution. Graphs with double axes have been used to misrepresent data because the two y axes do not share a common scale. Show that proportionally, natural gas decreased the largest amount (in raw numbers and percentage) and that “green energy sources” were relatively reliable during the freezing weather. Calculate percent of total energy that comes from each source, similar to HW 1.

Reminder: Leave a space between your headings’ ## and your words so that it formats correctly!

Data Preparation

Energy Output

Data source: Government website -Energy Information Administration (EIA) The Hourly Electric Grid Monitor
- Hourly measurements of power output for different energy sources.

Originally, I downloaded the data as a massive CSV file and kept only the data I needed. Later, I figured out how to use an API to make the data collection reproducible for anyone to download (after they get a unique EIA API key; see Methodology section). Date ranges included in the dataset go from January 1st to August 13th, the day the data was originally downloaded using the API.

Do echo = FALSE if you do not want the code to appear in the document with the output.

The data for February power sources in Texas originally contained 5,382 rows, where each row represented an hourly update on power output. Dates begin at January 1st and end at August 13th (when data was originally downloaded). Columns consisted of a time stamp and categories of energy (solar, wind, coal, etc.) Originally, the structure was a wide format where each energy source was it’s own variable, however, I wanted to transform my data into a longer format that only contains data from February 1st to February 28th where all the energy sources become one variable named “source”. In order to do this, I used pivot_longer() and filtered for a range of dates.

Texas_wide also does not have the dates/times separated. They are in one variable named date. For summarizing and graphing purposes, I may want to have date and time as separate variables and rename the original variable to datetime since it contains both items.

Texas_feb <- Texas_wide %>% 
  arrange(date) %>%
  filter(date >= as.Date("2021-02-01") & date <= as.Date("2021-02-28") ) %>%
  rename(datetime = date) %>%
  mutate(date = as_date(datetime),      # a new column appeared! stored as date
         time = hms::as_hms(datetime) ) # stored as S3:hms (format for time)

Texas_long <- Texas_wide %>% 
  arrange(date) %>%
  filter(date >= as.Date("2021-02-01") & date <= as.Date("2021-02-28") ) %>%
  pivot_longer(!date,  
               names_to = "source",
               values_to = "power") %>%
  rename(datetime = date) %>%
  mutate(date = as_date(datetime),      # a new column appeared! stored as date
         time = hms::as_hms(datetime) ) # stored as S3:hms (format for time)

str(Texas_long)
## tibble [3,894 x 5] (S3: tbl_df/tbl/data.frame)
##  $ datetime: POSIXct[1:3894], format: "2021-02-01 00:00:00" "2021-02-01 00:00:00" ...
##  $ source  : chr [1:3894] "Natural Gas" "Wind" "Coal" "Solar" ...
##  $ power   : num [1:3894] 13969 8957 7845 1697 50 ...
##  $ date    : Date[1:3894], format: "2021-02-01" "2021-02-01" ...
##  $ time    : 'hms' num [1:3894] 00:00:00 00:00:00 00:00:00 00:00:00 ...
##   ..- attr(*, "units")= chr "secs"

My new data frame containing information only for February, Texas_long, has 3,894 rows and 5 columns. Each row represents an hourly output from a specific energy source (i.e. there are multiple rows each hourly update).

  • datetime stored as POSIXct
  • source stored as character
  • power stored as numeric
  • date stored as date
  • time stored as hour-min-seconds

Temperatue

load("C:/Users/aleaw/OneDrive/Desktop/cuppackage/data/texastemperature.rda")
as_tibble(texastemperature)
# date stored as  character, temp as double
#57 rows, 2 columns

temp <- as_tibble(texastemperature) %>%
  mutate(date = mdy(date)) %>%      # date now stored as date instead of character
  filter(date >= as.Date("2021-02-01") & date <= as.Date("2021-02-28") )  # filter Feb.
temp
temp %>%
  ggplot() +
  geom_line(aes(date, temp_daily_avg)) +
  theme_bw()

Descriptive Statistics

  • Describe the data: How many observations? How many variables are there? What kind of variables are they (categorical, continuous, or ordinal)? How are they stored when you read your data into R? Was your data tidy? Do you want it in long or wide format (or both)?

  • What are the main variables you are using for your example? Provide appropriate descriptive statistics given the variable type. (Range, median, mean, distribution shape, count, etc.).

Do not just run a command and include the output. Interpret/summarize the key statistics in a sentence or two! Communicate to the reader!

summary(Texas_feb)
##     datetime                    Natural Gas         Wind            Coal      
##  Min.   :2021-02-01 00:00:00   Min.   : 4758   Min.   :  649   Min.   : 3873  
##  1st Qu.:2021-02-07 18:00:00   1st Qu.:11625   1st Qu.: 4409   1st Qu.: 6964  
##  Median :2021-02-14 12:00:00   Median :16479   Median : 7416   Median : 8855  
##  Mean   :2021-02-14 12:00:00   Mean   :20087   Mean   : 8608   Mean   : 8446  
##  3rd Qu.:2021-02-21 06:00:00   3rd Qu.:29812   3rd Qu.:12358   3rd Qu.:10477  
##  Max.   :2021-02-28 00:00:00   Max.   :43967   Max.   :22415   Max.   :11693  
##      Solar            Water           Nuclear          date           
##  Min.   :   0.0   Min.   : 43.00   Min.   :3780   Min.   :2021-02-01  
##  1st Qu.:   0.0   1st Qu.: 48.00   1st Qu.:5100   1st Qu.:2021-02-07  
##  Median :   2.0   Median : 72.00   Median :5115   Median :2021-02-14  
##  Mean   : 981.1   Mean   : 79.76   Mean   :4958   Mean   :2021-02-14  
##  3rd Qu.:1680.0   3rd Qu.: 97.00   3rd Qu.:5136   3rd Qu.:2021-02-21  
##  Max.   :4957.0   Max.   :343.00   Max.   :5149   Max.   :2021-02-28  
##      time         
##  Length:649       
##  Class1:hms       
##  Class2:difftime  
##  Mode  :numeric   
##                   
## 
Texas_feb %>%
  select(`Natural Gas`:Nuclear) %>%
describe(fast = TRUE)
ggplot(Texas_long) +
  geom_line(aes(datetime, power, color= source)) + 
  theme_classic()

The exploratory graph above shows the amount of power produced from each energy source from February 1st to February 28th.

Texas_long %>%
  group_by(source)%>%
  summarize(feb_sum = sum(power)) %>%
ggplot(aes(source, feb_sum) ) + 
  geom_col() + theme_minimal()

ggplot(Texas_feb, aes(Solar)) + 
  geom_histogram()

Texas_feb %>% 
ggplot(aes(`Natural Gas`)) + 
  geom_histogram()

ggplot(Texas_feb, aes(Wind)) +
  geom_density()

ggplot(Texas_feb, aes(Nuclear)) +
  geom_density() # interesting. Probably because it's either off or on depending on need.

ggplot(Texas_feb, aes(Coal)) +
  geom_density()

Monthly Summary Table

The simple table below shows how many megawatts each source produced during the month of February as a raw count of Megawatts and a percentage of total output from all energy sources:

Texas_long %>%
  group_by(source) %>%
  summarize(Megawatts = sum(power)) %>%
  mutate(Percent = round(prop.table(Megawatts), digits = 3))  %>%
   kable_classic(full_width = F)
## Error in if (!kable_format %in% c("html", "latex")) {: argument is of length zero

Percentage Change

Not complete.

To calculate the difference in energy output on February

HourlyChange <- Texas_long %>%
  group_by(source) %>% 
  arrange(datetime, .by_group = TRUE) %>%
  mutate(pct_change = (power/lag(power) - 1) * 100)


Texas_long %>%
  filter(date == as.Date("2021-02-15")) %>%
  group_by(source) %>% 
  summarize(dailysum15 = sum(power)) %>%
  mutate(Percent = round(prop.table(dailysum15), digits = 3))
Texas_long %>%
    filter(date == as.Date("2021-02-25")) %>%
  group_by(source) %>%
  summarize(Megawatts = sum(power)) %>%
  mutate(Percent = round(prop.table(Megawatts), digits = 3))

Moving Averages

To calculate a simple moving average (over 7 days), we can use the rollmean() function from the zoo package. This function takes a k, which is an integer width of the rolling window.

The code below calculates a 3, 5, 7, 15, and 21-day rolling average for the deathsfrom COVID in the US.

# install.packages("zoo")
library(zoo)

coal <- Texas_long %>%
  filter( source == "Coal") %>% # keep only coal observations
  arrange(datetime) %>%  # Start with Feb 1st
  mutate(source_03hr = rollmean(power, k = 3, fill = 0), # 3 hour average
         source_07hr = rollmean(power, k = 7, fill = 0), # 7 hour average
         source_12hr = rollmean(power, k = 12, fill = 0)) # 12 hour average
coal

Added 3 new columns for a rolling average of coal power output. I don’t need this, I’m just showing how I did it.

Doing it again for natural gas and storing it as its own object named “naturalgas”:

naturalgas <- Texas_long %>%
  filter( source == "Natural Gas") %>%
  arrange(desc(datetime)) %>% 
  mutate(original = power,
         source_03hr = rollmean(power, k = 3, fill = 0),
         source_07hr = rollmean(power, k = 7, fill = 0),
         source_12hr = rollmean(power, k = 12, fill = 0))
naturalgas 

Now to graph the megawatts from coal per hour with columns and the 12 hour rolling average with a line in one graph:

coal %>%
  ggplot(aes(x = datetime, 
             y = power)) +
  geom_col(fill = "light gray")  +
  geom_line(aes(y = source_12hr, 
                color = "red")) +
  theme_minimal() +
  labs(title="Power output from Coal", 
       x = "",
       y="Megawatts")

mov.avg <- Texas_long %>%
    arrange(datetime) %>% 
    group_by(source) %>% 
    summarise(datetime = datetime,
      source_03hr = rollmean(power, k = 3, fill = NA),
      source_07hr = rollmean(power, k = 7, fill = NA),
      source_12hr = rollmean(power, k = 12, fill = NA))
mov.avg

Recoding

“Thermal unit category” includes natural gas, coal and nuclear power. Recode using this definition and green energy sources just for fun.

Texas_long %>%
  mutate(
    Energy = case_when(
      source == "Wind" | source == "Water" | source =="Solar" ~ "Green",
      source == "Coal" | source == "Nuclear" | source == "Natural Gas" ~ "ThermalUnit") ) %>%
  group_by(Energy) %>%
  summarize(Megawatts = sum(power))

Correlation

cor(Texas_feb$Wind, Texas_feb$`Natural Gas`) # 2 variables
## [1] -0.7450777
# correlation matrix
Texas_feb %>%
    select(`Natural Gas`:Nuclear) %>%
    cor(use = "pairwise.complete.obs")
##             Natural Gas        Wind        Coal        Solar        Water
## Natural Gas   1.0000000 -0.74507769  0.54098879 -0.179418037  0.473690492
## Wind         -0.7450777  1.00000000 -0.53655719 -0.063957320 -0.276937349
## Coal          0.5409888 -0.53655719  1.00000000 -0.038273160  0.155804984
## Solar        -0.1794180 -0.06395732 -0.03827316  1.000000000  0.009598617
## Water         0.4736905 -0.27693735  0.15580498  0.009598617  1.000000000
## Nuclear      -0.3271192  0.40138951  0.14874660  0.034718663 -0.281910443
##                 Nuclear
## Natural Gas -0.32711919
## Wind         0.40138951
## Coal         0.14874660
## Solar        0.03471866
## Water       -0.28191044
## Nuclear      1.00000000

Discuss comments on correlations. What goes up the most as the other goes down? Throw that into the discussion of the topic and assumptions that were made at the time.

Communicating Implications / Conclusion

In the early morning hours of Feb. 15, natural gas generation dropped 23% by 4 a.m., a total of about 10,000 megawatts on a system that was running about 65,000 megawatts in total at midnight. That morning ERCOT started rolling blackouts.

So, it’s true that wind plays a significant role in Texas’ power supply — the state actually generates more wind energy than any other state in the nation — but there’s no indication that wind energy was the primary cause of the power outages in Texas.

Blah blah blah add more stuff that relates to your graphs and tables to support your argument.

Sources

Energy Information Administration (EIA) The Hourly Electric Grid Monitor

“Wind Turbines Didn’t Cause Texas Energy Crisis” FactCheck.org

“How Fox News, far-right TV blamed green energy for Texas’ power outages” Politifact

Methodology

#' Texas's Energy Output
#'
#' Energy output (megawatthours) from each source for Texas during the winter storm.
#' Data are from the U.S. Energy Information Administration
#' You will need YOUR OWN api eia key to download the data through their API
#' Get an API key here: https://www.eia.gov/developer/
#' Check out the Hourly Electric Grid Monitor here: https://www.eia.gov/electricity/gridmonitor/dashboard/electric_overview/US48/US48
#' @format A data frame with 5382 rows and 7 variables:
#' \describe{
#'   \item{date}{Date and time of measurement}
#'   \item{Natural Gas}{Supply of energy in megawatthours from Natural Gas}
#'   \item{Wind}{}
#'   \item{Coal}{}
#'   \item{Solar}{}
#'   \item{Water}{}
#'   \item{Nuclear}{}

#' }
#' @source \url{https://www.eia.gov/}
library("eia") # works with API for downloading data from the EIA

## replicatable way
eia_set_key("c0817f67f7817ab45b9f7e8dbf0de9bb")

# Prep the responses
base_url = 'http://api.eia.gov/series/?series_id='

variables <- c("EBA.TEX-ALL.NG.NG.H", 
               "EBA.TEX-ALL.NG.WND.H", 
               "EBA.TEX-ALL.NG.COL.H", 
               "EBA.TEX-ALL.NG.SUN.H", 
               "EBA.TEX-ALL.NG.WAT.H", 
               "EBA.TEX-ALL.NG.NUC.H")

list <- eia_series(variables, start = 2021)
#Downloaded on August 13th. Last day of data included.

list$data[[3]] # 3,438 X 5 tibble

unlist <- unnest(list, cols = data)

unlist <- unnest(list, cols = data) %>%
  select(date, series_id, value) %>%
  pivot_wider(names_from = series_id,
              values_from = value)


storm <- unlist %>% rename("Natural Gas" = "EBA.TEX-ALL.NG.NG.H",
                            "Wind" = "EBA.TEX-ALL.NG.WND.H",
                            "Coal" = "EBA.TEX-ALL.NG.COL.H",
                            "Solar" = "EBA.TEX-ALL.NG.SUN.H",
                            "Water" = "EBA.TEX-ALL.NG.WAT.H",
                            "Nuclear" =  "EBA.TEX-ALL.NG.NUC.H")


load("C:/Users/aleaw/OneDrive/Desktop/cuppackage/data/storm.rda")

Texas_wide <- storm

Texas_feb <- Texas_wide %>% 
  arrange(date) %>%
  filter(date >= as.Date("2021-02-01") & date <= as.Date("2021-02-28") ) %>%
  rename(datetime = date) %>%
  mutate(date = as_date(datetime),      # a new column appeared! stored as date
         time = hms::as_hms(datetime) ) # stored as S3:hms (format for time)

Texas_long <- Texas_wide %>% 
  arrange(date) %>%
  filter(date >= as.Date("2021-02-01") & date <= as.Date("2021-02-28") ) %>%
  pivot_longer(!date,  
               names_to = "source",
               values_to = "power") %>%
  rename(datetime = date) %>%
  mutate(date = as_date(datetime),      # a new column appeared! stored as date
         time = hms::as_hms(datetime) ) # stored as S3:hms (format for time)

str(Texas_long)

load("C:/Users/aleaw/OneDrive/Desktop/cuppackage/data/texastemperature.rda")
as_tibble(texastemperature)
# date stored as  character, temp as double
#57 rows, 2 columns

temp <- as_tibble(texastemperature) %>%
  mutate(date = mdy(date)) %>%      # date now stored as date instead of character
  filter(date >= as.Date("2021-02-01") & date <= as.Date("2021-02-28") )  # filter Feb.
temp

temp %>%
  ggplot() +
  geom_line(aes(date, temp_daily_avg))