R Project

The R code you provided is typically found in R Markdown documents, which are used for creating dynamic reports that combine text, code, and the output of code chunks.

Here’s what this specific code does:

{r setup, include=FALSE}: This is called a code chunk header in R Markdown. It specifies the settings for a code chunk that follows. In this case:
    setup is the name of the code chunk. It's a user-defined label that you can use to refer to this code chunk elsewhere in the document.
    include=FALSE indicates that the code chunk should not be included in the final output document. This is often used for code chunks that contain setup or configuration code that is not meant to be visible in the document.

knitr::opts_chunk$set(echo = TRUE): This line of code sets an option for the code chunk. Specifically, it sets the echo option to TRUE. In the context of R Markdown and knitr, setting echo = TRUE means that the code within this code chunk will be displayed in the final output document, along with its output (if any). This is useful for showing the code and its results in the generated report.

So, the code chunk labeled setup is configuring the behavior of code chunks that follow in the document, ensuring that the code within those chunks will be displayed in the output. It’s a common setup when you want to display code and its results in an R Markdown document.

The provided R code performs the following actions:

{r Load Data}: This is a code chunk header in an R Markdown document. It serves as a label for this code chunk, allowing you to refer to it elsewhere in the document.

polldata <- read.csv("https://raw.githubusercontent.com/jewelercart/Data606_2023/main/president_pollsdatasets.csv"): This line of code reads a CSV file from a URL and stores its contents in a variable called polldata. Here's a breakdown of what this line does:
    read.csv(): This is a function in R used to read CSV (Comma-Separated Values) files.
    "https://raw.githubusercontent.com/jewelercart/Data606_2023/main/president_pollsdatasets.csv": This is the URL of the CSV file you are reading. It appears to be a dataset related to presidential polls.
    polldata: This is the name of the variable where the data from the CSV file will be stored.

head(polldata): This line of code displays the first few rows of the polldata dataset using the head() function. This is often done to quickly inspect the data and get a sense of its structure and content.

So, in summary, the code reads a CSV file from a URL, stores it in the polldata variable, and then displays the first few rows of the dataset in the output. This is a common sequence of steps when working with data in R, particularly in the context of data analysis or data exploration.

Load data into R

polldata <- read.csv("https://raw.githubusercontent.com/jewelercart/Data606_2023/main/president_pollsdatasets.csv")
head(polldata)

The provided R code is another code chunk header in an R Markdown document. It serves as a label for this code chunk, allowing you to refer to it elsewhere in the document. This code chunk does not contain any code that performs specific actions; rather, it loads two R packages: tidyverse and ggplot2.

Here’s what these lines do:

library(tidyverse): This line loads the tidyverse package. The tidyverse is a collection of R packages (e.g., dplyr, ggplot2, readr) that are commonly used for data manipulation, visualization, and analysis. By loading this package, you gain access to functions and tools provided by the tidyverse ecosystem.

library(ggplot2): This line loads the ggplot2 package specifically. ggplot2 is a popular package for creating data visualizations and plots in a flexible and customizable way. By loading ggplot2, you can use its functions and capabilities for creating graphs and charts.

In an R Markdown document, it’s common to include these library loading commands at the beginning of the document’s code chunks to ensure that the necessary packages are available for subsequent code and visualizations. These packages are quite popular in the R data analysis and visualization community.

Take subset of data

You can also embed plots, for example:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)

The provided R code appears to perform the following actions:

pollsubset = select(polldata, poll_id, pollster_id, start_date, end_date, sample_size, office_type, party, answer, candidate_name, pct): This line of code uses the select() function from the dplyr package (which is part of the tidyverse ecosystem) to create a new data frame called pollsubset. The select() function is used to choose and keep specific columns from an existing data frame (polldata in this case). It lists the columns to include in the new data frame.
    polldata: This is assumed to be the name of an existing data frame.
    The columns specified in the function (e.g., poll_id, pollster_id, start_date, etc.) are the columns that will be included in the pollsubset data frame.

head(pollsubset): This line of code displays the first few rows of the pollsubset data frame using the head() function. This is typically done to inspect the selected subset of data and see what it looks like.

In summary, the code selects specific columns from an existing data frame (polldata) and creates a new data frame called pollsubset containing only those selected columns. Then, it displays the first few rows of the pollsubset data frame to provide an initial view of the data. This is a common data manipulation step in R, often used to focus on specific columns of interest for analysis or visualization.

pollsubset = select(polldata, poll_id, pollster_id, start_date, end_date, sample_size, office_type, party, answer, candidate_name, pct)
head(pollsubset)

The provided R code appears to create a bar plot using the ggplot2 package to visualize data from the pollsubset data frame. Here’s what the code does step by step:

{r plot}: This is a code chunk header in an R Markdown document, and it labels this code chunk as "plot," allowing you to reference it elsewhere in the document.

ggplot(data = pollsubset, mapping = aes(x = party, fill = party)): This line initiates a ggplot object. It specifies the data source as the pollsubset data frame and defines the aesthetics (aes) for the plot:
    x = party: This sets the x-axis variable to be the "party" column from the pollsubset data frame. The values in this column will be used for the x-axis.
    fill = party: This sets the fill aesthetic, which is used to color the bars in the plot based on the "party" column. Each unique value in the "party" column will be represented by a differently colored bar.

geom_bar(): This function adds a bar geometry layer to the plot. It creates a bar chart where the height of each bar represents the count of observations for each unique value in the "party" column.

In summary, this R code generates a bar plot using ggplot2, with the x-axis representing different political parties (from the “party” column in the pollsubset data frame) and the bars colored based on the political party. The height of each bar indicates the count of observations for each party in the data. This code is used for data visualization, specifically for visualizing the distribution of parties in the dataset.

ggplot(data= pollsubset, mapping = aes(x= party, fill= party))+
  geom_bar()

The provided R code uses the %>% operator (pipe operator) and the dplyr package to perform grouping and tallying operations on the pollsubset data frame. Here’s what the code does step by step:

{r }: This is a code chunk header in an R Markdown document, and it doesn't have a specific label or name associated with it.

pollsubset %>%: The %>% operator, often referred to as the pipe operator, is used to pass the pollsubset data frame to the next operation or function.

group_by(party): This is a dplyr function that groups the data based on the values in the "party" column of the pollsubset data frame. It essentially prepares the data for aggregation by creating groups for each unique value in the "party" column.

tally(): This is another dplyr function that is applied after grouping. It counts the number of observations (rows) within each group created by the group_by() function. In this case, it counts the number of rows for each political party, effectively tallying the number of occurrences of each party in the dataset.

So, the code takes the pollsubset data frame, groups it by the “party” column, and then calculates the count of observations (tallies) for each unique political party in the dataset. The result will be a table that shows the count of occurrences for each party.

pollsubset %>% group_by(party)%>% tally()

The provided R code does the following:

{r}: This is a code chunk header in an R Markdown document, and it doesn't have a specific label or name associated with it.

biden_trump_data <- filter(pollsubset, candidate_name == 'Joe Biden' | candidate_name =='Donald Trump'): This line of code uses the filter() function from the dplyr package to create a new data frame called biden_trump_data. The purpose of this line is to filter the pollsubset data frame and retain only rows where the "candidate_name" is either 'Joe Biden' or 'Donald Trump'. The | operator is used for the logical OR condition.

ggplot(data = biden_trump_data, aes(x= candidate_name , fill=candidate_name)): This line initiates a ggplot object using the biden_trump_data data frame as the data source and specifies aesthetics (aes) for the plot:
    x = candidate_name: This sets the x-axis variable to be the "candidate_name" column from the biden_trump_data data frame, which will represent the candidate names ('Joe Biden' and 'Donald Trump') on the x-axis.
    fill = candidate_name: This sets the fill aesthetic, which will be used to color the bars in the plot based on the "candidate_name" column. Each candidate's bars will be filled with a different color.

geom_bar(): This function adds a bar geometry layer to the plot. It creates a bar chart where the height of each bar represents the count of observations for each candidate (either 'Joe Biden' or 'Donald Trump') in the biden_trump_data data frame.

In summary, this R code generates a bar plot using ggplot2, specifically comparing the counts of ‘Joe Biden’ and ‘Donald Trump’ in the biden_trump_data data frame. The x-axis represents the candidate names, and the bars are colored based on the candidate name. This code is used to visualize the distribution of votes or poll results between these two candidates.

biden_trump_data <- filter(pollsubset, candidate_name == 'Joe Biden' | candidate_name =='Donald Trump')

ggplot(data = biden_trump_data, aes(x= candidate_name , fill=candidate_name))+
  geom_bar()

Conclusion

The data can be accessed from the repository like github directly into R using the function read.csv(). It brings the reproducibility to the output and anyone can run the codes without disturbing the data directory. So the new skill learnt through this assignment will be useful in my data science future.

R Project

Frederick Jones

2023-09-10

Load data into R

Take subset of data

Conclusion