MATH2405 TP3, 2021

Setup

# Load in the Packages 

#library(readr) # Useful for importing data
# library(foreign) # Useful for importing SPSS, SAS, STATA etc. data files
#library(rvest) # Useful for scraping HTML data
#library(knitr) # Useful for creating nice tables

library(dplyr)    # For data manipulation

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(magrittr) # For pipes 

# Clear the workspace
rm(list = ls())

Locate Data

Locate an open source of data from the web. This can be a tabular, spreadsheet data (i.e., .txt, .csv, .xls, .xlsx files), data sets from other statistical software (i.e., SPSS, SAS, Stata etc.), or you can scrape HTML table data.

Some sources for open data are provided below, but I encourage you to find others:

As a minimum, your data set should include:

one numeric variable.
one qualitative (categorical) variable.

There is no limit on the number of observations and number of variables. But keep in mind that when you have a very large data set, it will increase your reading time. A clear description of data and its source should be provided in this section.

Read/Import Data

Read/Import the data into R, then save it as a data frame. You can use Base R functions or readr, xlsx, readxl, foreign, rvest packages for this purpose. You must also provide the R codes with outputs (i.e. head() of the data set) that you used to import/read/scrape the data set. You can provide the R codes with outputs using R chunks like this:

# This is an R chunk for importing the data. Provide your R codes here:

# Perform the import of the data with no special conversions applied (Example: no "stringsAsFactors = TRUE"))

# Define your URL
dataUrl <- "https://raw.githubusercontent.com/fivethirtyeight/covid-19-polls/master/covid_concern_polls.csv"

# Read in the csv file, replacing any blank values with N/A
dataImp <- read.csv(dataUrl, 
              header = TRUE,
              sep = ",", 
              na.strings = "")

# Verify that the data is stored in a Data Frame
class(dataImp)

## [1] "data.frame"

# Display the first set of rows 
head(dataImp, n = 10)

##    start_date   end_date                 pollster         sponsor sample_size
## 1  2020-01-27 2020-01-29          Morning Consult            <NA>        2202
## 2  2020-01-31 2020-02-02          Morning Consult            <NA>        2202
## 3  2020-02-02 2020-02-04                   YouGov       Economist        1500
## 4  2020-02-07 2020-02-09          Morning Consult            <NA>        2200
## 5  2020-02-07 2020-02-09                   YouGov Huffington Post        1000
## 6  2020-02-09 2020-02-11                   YouGov       Economist        1500
## 7  2020-02-13 2020-02-16                  AP-NORC            <NA>        1074
## 8  2020-02-13 2020-02-18 Kaiser Family Foundation            <NA>        1207
## 9  2020-02-13 2020-02-18 Kaiser Family Foundation            <NA>        1207
## 10 2020-02-16 2020-02-18                   YouGov       Economist        1500
##    population party          subject tracking
## 1           a   all  concern-economy    FALSE
## 2           a   all  concern-economy    FALSE
## 3           a   all concern-infected    FALSE
## 4           a   all  concern-economy    FALSE
## 5           a   all concern-infected    FALSE
## 6           a   all concern-infected    FALSE
## 7           a   all concern-infected    FALSE
## 8           a   all  concern-economy    FALSE
## 9           a   all concern-infected    FALSE
## 10          a   all concern-infected    FALSE
##                                                                                                                                                             text
## 1                                                                             How concerned are you that the coronavirus will impact the following? U.S. economy
## 2                                                                             How concerned are you that the coronavirus will impact the following? U.S. economy
## 3  Taking into consideration both your risk of contracting it and the seriousness of the illness, how worried are you personally about experiencing coronavirus?
## 4                                                                             How concerned are you that the coronavirus will impact the following? U.S. economy
## 5                                                                        How concerned are you that you or someone in your family will contract the coronavirus?
## 6  Taking into consideration both your risk of contracting it and the seriousness of the illness, how worried are you personally about experiencing coronavirus?
## 7                                                                   How worried are you about you or someone in your family being infected with the Coronavirus?
## 8                                                        How concerned, if at all, are you that the Coronavirus will have a negative impact on the U.S. economy?
## 9                                                       How concerned, if at all, are you that you or someone in your family will get sick from the Coronavirus?
## 10 Taking into consideration both your risk of contracting it and the seriousness of the illness, how worried are you personally about experiencing coronavirus?
##    very somewhat not_very not_at_all
## 1    19       33       23         11
## 2    26       32       25          7
## 3    13       26       43         18
## 4    23       32       24          9
## 5    11       24       33         20
## 6    11       28       39         22
## 7    22       23       37         19
## 8    22       35       28         15
## 9    22       21       33         23
## 10   10       28       42         19
##                                                                                                      url
## 1  https://morningconsult.com/wp-content/uploads/2020/02/200167_crosstabs_CORONAVIRUS_Adults_v2_JB-1.pdf
## 2  https://morningconsult.com/wp-content/uploads/2020/02/200191_crosstabs_CORONAVIRUS_Adults_v2_JB-1.pdf
## 3            https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
## 4    https://morningconsult.com/wp-content/uploads/2020/02/200214_crosstabs_CORONAVIRUS_Adults_v4_JB.pdf
## 5                https://projects.fivethirtyeight.com/polls/20200318_National_HPYouGov_coronavirus_1.pdf
## 6            https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/79zfxkws33/econTabReport.pdf
## 7                        http://www.apnorc.org/PDFs/Omnibus%202020/February/topline_omni_feb_full_v3.pdf
## 8                         http://files.kff.org/attachment/Topline-KFF-Health-Tracking-Poll-February-2020
## 9                         http://files.kff.org/attachment/Topline-KFF-Health-Tracking-Poll-February-2020
## 10           https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/m3wzkd0n59/econTabReport.pdf

# Display the last set of rows 
tail(dataImp, n = 5)

##     start_date   end_date                           pollster
## 674 2021-04-06 2021-04-06                  Data for Progress
## 675 2021-04-09 2021-04-11                             LÃ©ger
## 676 2021-04-10 2021-04-13                             YouGov
## 677 2021-04-17 2021-04-20                             YouGov
## 678 2021-03-08 2021-03-30 Public Religion Research Institute
##                              sponsor sample_size population party
## 674                             <NA>        1244          a   all
## 675 Association for Canadian Studies        1002          a   all
## 676                        Economist        1500          a   all
## 677                        Economist        1500          a   all
## 678            Interfaith Youth Core        5625          a   all
##              subject tracking
## 674 concern-infected    FALSE
## 675 concern-infected    FALSE
## 676 concern-infected    FALSE
## 677 concern-infected    FALSE
## 678 concern-infected    FALSE
##                                                                                                                                                           text
## 674                                                                                  How worried are you personally about experiencing coronavirus (COVID-19)?
## 675                                                                                       Are you personally afraid of contracting the COVID-19 (Coronavirus)?
## 676 Taking into consideration both your risk of contracting it and the seriousness of the illness, how worried are you personally about experiencing COVID-19?
## 677 Taking into consideration both your risk of contracting it and the seriousness of the illness, how worried are you personally about experiencing COVID-19?
## 678                                     How worried, if at all, are you about each of the following? You or someone in your family will get sick with COVID-19
##      very somewhat not_very not_at_all
## 674 22.34    34.16    22.65      17.25
## 675 20.00    30.00    22.00      20.00
## 676 16.00    34.00    29.00      20.00
## 677 20.00    31.00    31.00      17.00
## 678 23.00    39.00    27.00      11.00
##                                                                                                                                       url
## 674                                        https://docs.google.com/spreadsheets/d/1DzzQ2QJL7UdG1Cgeu9EaNXEqQr2pLBOqeiS7E6FlZKM/edit#gid=0
## 675 https://2g2ckk18vixp3neolz4b6605-wpengine.netdna-ssl.com/wp-content/uploads/2021/04/Legers-North-American-Tracker-April-12th-2021.pdf
## 676                                                                              https://docs.cdn.yougov.com/wvjmyy0dlk/econTabReport.pdf
## 677                                                                              https://docs.cdn.yougov.com/e89wuts0a9/econTabReport.pdf
## 678      https://www.prri.org/wp-content/uploads/2021/04/Topline-IFYC-PRRI-Survey-on-Religion-and-COVID-19-Vaccine-Trust-4-22-release.pdf

Explain everything that you do in this step using regular text outside the R chunks. You may use bulleted lists like this:

Steps taken:

Identify suitable dataset for the assignment
From the identified options, visably inspect the data directly, either online or download and viewed in MS Excel.
Locate the various URL’s required
Begin coding the R Markdown file

Data description

Provide a clear description of the data and its source (i.e. URL of the web site). Provide variable descriptions.

Website: https://projects.fivethirtyeight.com/coronavirus-polls/, has the heading “How Americans View Biden’s Response To The Coronavirus Crisis”. The website presents a number of data visualisations relating to survey results of the American population and their concerns around Coronavirus.

There are 6 datasets available directly from this website which can be downloaded as a .zip file or individually.

These datasets can be located here: https://github.com/fivethirtyeight/covid-19-polls

The dataset I have chosen is: “covid_concern_polls.csv”, which has a combination of 2 different survey questions (Column “subject”)

Concern-Economy: How concerned is the survey participant about how coronavirus will impact the US Economy.
Concern-infected: How concerned is the survey participant about coronovirus impacting them directly.

Excluding the heading, there are 678 rows of data with 15 variables.

The column headings are relatively easy to interpret.

Variable Descriptions: (Sourced from the accompanying README.md from the downloaded zip file)

start_date: Start date of the poll.
end_date: End date of the poll.
pollster: Organisation that conducted the poll.
sponsor: Sponsor for the poll, not every poll has a sponsor.
sample_size: Size of polling sample.
population: ‘A’ for adults. ‘RV’ for registered voters, ‘LV’ for likely voters.
party: Which party the respondents belong to.
subject: Subject of poll question.
tracking: ‘TRUE’ if the poll is a tracking poll, meaning that the pollster is releasing data with overlapping samples
text: Text of the poll question
Poll Answer 4 options represented across 4 columns (“very”,“somewhat”,“not_very”,not_at_all).
url Link to the poll.

The text (Column “text”) of the survey questions are not always identical, however the high level column “Subject” provides the high-level intention of the survey question.

Dataset Source: https://raw.githubusercontent.com/fivethirtyeight/covid-19-polls/master/covid_concern_polls.csv

Inspect dataset and variables

Inspect the data frame and variables using R functions. You should:

check the dimensions of the data frame.
check the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the data set. If variables are not in the correct data type, apply proper type conversions.
check the levels of factor variables, rename/rearrange them if required.
check the column names in the data frame, rename them if required.

Provide your R codes with outputs and explain everything that you do in this step.

# This is a chunk where you inspect the types of variables, data structures, check the attributes in the data.

# Check the dimensions of the dataframe
dim(dataImp)

## [1] 678  15

# Check the data types of each column
str(dataImp)

## 'data.frame':    678 obs. of  15 variables:
##  $ start_date : chr  "2020-01-27" "2020-01-31" "2020-02-02" "2020-02-07" ...
##  $ end_date   : chr  "2020-01-29" "2020-02-02" "2020-02-04" "2020-02-09" ...
##  $ pollster   : chr  "Morning Consult" "Morning Consult" "YouGov" "Morning Consult" ...
##  $ sponsor    : chr  NA NA "Economist" NA ...
##  $ sample_size: int  2202 2202 1500 2200 1000 1500 1074 1207 1207 1500 ...
##  $ population : chr  "a" "a" "a" "a" ...
##  $ party      : chr  "all" "all" "all" "all" ...
##  $ subject    : chr  "concern-economy" "concern-economy" "concern-infected" "concern-economy" ...
##  $ tracking   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ text       : chr  "How concerned are you that the coronavirus will impact the following? U.S. economy" "How concerned are you that the coronavirus will impact the following? U.S. economy" "Taking into consideration both your risk of contracting it and the seriousness of the illness, how worried are "| __truncated__ "How concerned are you that the coronavirus will impact the following? U.S. economy" ...
##  $ very       : num  19 26 13 23 11 11 22 22 22 10 ...
##  $ somewhat   : num  33 32 26 32 24 28 23 35 21 28 ...
##  $ not_very   : num  23 25 43 24 33 39 37 28 33 42 ...
##  $ not_at_all : num  11 7 18 9 20 22 19 15 23 19 ...
##  $ url        : chr  "https://morningconsult.com/wp-content/uploads/2020/02/200167_crosstabs_CORONAVIRUS_Adults_v2_JB-1.pdf" "https://morningconsult.com/wp-content/uploads/2020/02/200191_crosstabs_CORONAVIRUS_Adults_v2_JB-1.pdf" "https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf" "https://morningconsult.com/wp-content/uploads/2020/02/200214_crosstabs_CORONAVIRUS_Adults_v4_JB.pdf" ...

# Columns: start_date and end_date are defined with data type and need to be converted to proper date variables
dataImp$start_date <- as.Date(dataImp$start_date)
dataImp$end_date <- as.Date(dataImp$end_date)

# Check if the conversion of the dates is applied correctly
str(dataImp)

## 'data.frame':    678 obs. of  15 variables:
##  $ start_date : Date, format: "2020-01-27" "2020-01-31" ...
##  $ end_date   : Date, format: "2020-01-29" "2020-02-02" ...
##  $ pollster   : chr  "Morning Consult" "Morning Consult" "YouGov" "Morning Consult" ...
##  $ sponsor    : chr  NA NA "Economist" NA ...
##  $ sample_size: int  2202 2202 1500 2200 1000 1500 1074 1207 1207 1500 ...
##  $ population : chr  "a" "a" "a" "a" ...
##  $ party      : chr  "all" "all" "all" "all" ...
##  $ subject    : chr  "concern-economy" "concern-economy" "concern-infected" "concern-economy" ...
##  $ tracking   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ text       : chr  "How concerned are you that the coronavirus will impact the following? U.S. economy" "How concerned are you that the coronavirus will impact the following? U.S. economy" "Taking into consideration both your risk of contracting it and the seriousness of the illness, how worried are "| __truncated__ "How concerned are you that the coronavirus will impact the following? U.S. economy" ...
##  $ very       : num  19 26 13 23 11 11 22 22 22 10 ...
##  $ somewhat   : num  33 32 26 32 24 28 23 35 21 28 ...
##  $ not_very   : num  23 25 43 24 33 39 37 28 33 42 ...
##  $ not_at_all : num  11 7 18 9 20 22 19 15 23 19 ...
##  $ url        : chr  "https://morningconsult.com/wp-content/uploads/2020/02/200167_crosstabs_CORONAVIRUS_Adults_v2_JB-1.pdf" "https://morningconsult.com/wp-content/uploads/2020/02/200191_crosstabs_CORONAVIRUS_Adults_v2_JB-1.pdf" "https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf" "https://morningconsult.com/wp-content/uploads/2020/02/200214_crosstabs_CORONAVIRUS_Adults_v4_JB.pdf" ...

# Columns: pollster, sponsor, population, party, subject, text and url...can all be converted to factors
dataImp$pollster   <- as.factor(dataImp$pollster)
dataImp$sponsor    <- as.factor(dataImp$sponsor)
dataImp$population <- as.factor(dataImp$population)
dataImp$party      <- as.factor(dataImp$party)
dataImp$subject    <- as.factor(dataImp$subject)
dataImp$text       <- as.factor(dataImp$text)
dataImp$url        <- as.factor(dataImp$url)

# Check if the conversion of the column to factors is applied correctly
str(dataImp)

## 'data.frame':    678 obs. of  15 variables:
##  $ start_date : Date, format: "2020-01-27" "2020-01-31" ...
##  $ end_date   : Date, format: "2020-01-29" "2020-02-02" ...
##  $ pollster   : Factor w/ 41 levels "Ã\230ptimus",..: 24 24 41 24 41 41 3 18 18 41 ...
##  $ sponsor    : Factor w/ 36 levels "ABC News","American Enterprise Institute",..: NA NA 12 NA 17 12 NA NA NA 12 ...
##  $ sample_size: int  2202 2202 1500 2200 1000 1500 1074 1207 1207 1500 ...
##  $ population : Factor w/ 3 levels "a","lv","rv": 1 1 1 1 1 1 1 1 1 1 ...
##  $ party      : Factor w/ 1 level "all": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject    : Factor w/ 2 levels "concern-economy",..: 1 1 2 1 2 2 2 1 2 2 ...
##  $ tracking   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ text       : Factor w/ 90 levels "And thinking now about the coronavirus outbreak, how worried are you that you or someone in your family will co"| __truncated__,..: 28 28 83 28 31 83 52 37 38 83 ...
##  $ very       : num  19 26 13 23 11 11 22 22 22 10 ...
##  $ somewhat   : num  33 32 26 32 24 28 23 35 21 28 ...
##  $ not_very   : num  23 25 43 24 33 39 37 28 33 42 ...
##  $ not_at_all : num  11 7 18 9 20 22 19 15 23 19 ...
##  $ url        : Factor w/ 511 levels "http://apnorc.org/PDFs/AP-NORC%20June%202020/Topline_covid.pdf",..: 270 271 87 272 354 88 8 4 4 89 ...

# Check the factor values

# Checking Factors for column 'pollster'
# levels(dataImp$pollster) # -----------------------Output is Excessive

# For pollster there are few values which are corrupted by special characters
# # Example: "Ã˜ptimus", "LÃ©ger" and "Firehouse Strategies/Ã˜ptimus"
# #           After investigations located replacement values
levels(dataImp$pollster) <- c(levels(dataImp$pollster), "Optimus", "Firehouse Strategies/Optimus", "Léger") 
levels(dataImp$pollster)[1] <- "Optimus"
levels(dataImp$pollster)[10] <- "Firehouse Strategies/Optimus"
levels(dataImp$pollster)[19] <- "Léger"

# Check the changed levels for column 'pollster'
# levels(dataImp$pollster) # -----------------------Output is Excessive

# Levels for Factor 'pollster' have corrected descriptions now

# Checking Factors for column 'sponsor'
# levels(dataImp$sponsor) # -----------------------Output is Excessive

# 2 factors effectively share the same value
# # 18 "Huffington Post"                                          
# # 19 "HuffPost" 

# Update Factor 19 to be the same as 18
levels(dataImp$sponsor)[19] <- "Huffington Post"

# Check the changed levels for column 'pollster'
# levels(dataImp$sponsor) # -----------------------Output is Excessive

# 'HuffPost is effectively removed

# Checking Factors for column 'population'
levels(dataImp$population)

## [1] "a"  "lv" "rv"

# Values are quite meaningless, lets add the proper descriptions
# # 'A' for adults. 'RV' for registered voters, 'LV' for likely voters.

levels(dataImp$population) <- c(levels(dataImp$population), "Adults", "Likely Voters", "Registered Voters") 
levels(dataImp$population)[1] <- "Adults"
levels(dataImp$population)[2] <- "Likely Voters"
levels(dataImp$population)[3] <- "Registered Voters"

# Check the changed levels for column 'population'
levels(dataImp$population)

## [1] "Adults"            "Likely Voters"     "Registered Voters"

# Levels for Factor 'population' have more description values now

# Checking Factors for column 'party'
levels(dataImp$party)

## [1] "all"

# Only 1 value no need to change

# Checking Factors for column 'subject'
levels(dataImp$subject)

## [1] "concern-economy"  "concern-infected"

# Only 2 values which are descriptive enough

# Checking Factors for column 'text' 
#####levels(dataImp$text) ....................COMMENTING OUT FOR NOW...UNCOMMENT ONCE ALL QUESTIONS ARE DONE

# Element 18 is the same as 17 except the '-' is replaced by special characters, will set 18 to be the same as 17
levels(dataImp$text)[18] <- "How concerned are you about someone in your family becoming seriously ill from the coronavirus outbreak - very concerned, somewhat concerned, not too concerned, or not at all concerned?"

#levels(dataImp$text) # -----------------------Output is Excessive

# Element 40 is the same as 41 except there are two '-', will set 40 to be the same as 41 
levels(dataImp$text)[40] <- "How do you feel about the possibility that you or someone in your immediate family might catch the coronavirus - very worried, somewhat worried, not too worried, or not worried at all?"

#levels(dataImp$text) # -----------------------Output is Excessive

# Element 41 is the same as 40 except the '-' is replaced by special characters, will set 41 to be the same as 40
levels(dataImp$text)[41] <- "How do you feel about the possibility that you or someone in your immediate family might catch the coronavirus - very worried, somewhat worried, not too worried, or not worried at all?"

#levels(dataImp$text) # -----------------------Output is Excessive

# Element 41 is the same as 40 except the '-' is replaced by special characters and there is an extra ',', will set 41 to be the same as 40
levels(dataImp$text)[41] <- "How do you feel about the possibility that you or someone in your immediate family might catch the coronavirus - very worried, somewhat worried, not too worried, or not worried at all?"

#levels(dataImp$text) # -----------------------Output is Excessive

# Element 47 is the same as 48 except the '-' is replaced by special characters, will set 47 to be the same as 48
levels(dataImp$text)[47] <- "How worried are you about you or someone in your family being infected with the Coronavirus?"

#levels(dataImp$text) # -----------------------Output is Excessive

# Element 48 is the same as 47 except the '-' is replaced by special characters, will set 48 to be the same as 47
levels(dataImp$text)[48] <- "How worried are you about you or someone in your family being infected with the Coronavirus?"

#levels(dataImp$text) # -----------------------Output is Excessive

# Element 52 is the same as 51 except there is an extra "\n" at the end, will set 52 to be the same as 51
levels(dataImp$text)[52] <- "How worried are you that the coronavirus outbreak will have a negative economic effect on the United States?"

# Final result for cleaned up 'text' Factor values 
###levels(dataImp$text) ....................COMMENTING OUT FOR NOW...UNCOMMENT ONCE ALL QUESTIONS ARE DONE

# A lot of even the cleaned up 'text' values are very similarly worded, if they were definitely to be used as part of the analysis then further consolidations could be performed

# Checking Factors for column 'url' 
###levels(dataImp$url) ....................COMMENTING OUT FOR NOW...UNCOMMENT ONCE ALL QUESTIONS ARE DONE

# 511 out of 657 observations have a unique url, it's unlikely any further analysis would be performed on this value

# Check the Column Headings

# For the columns with the responses (very, somewhat, not_very, not_at_all), it is not clear if this is a 
# count or a percentage response
# # To determine this will sum the column values
dataImp <- mutate(dataImp, Total = very + somewhat + not_very + not_at_all)  

# Determine the Minimum and Maximum values
range(dataImp$Total, na.rm=TRUE)

## [1]  80 101

# From this we see that ranges of values is between 80 and 101 (101 being an anomaly in itself as percentage should max at 100)
# Considering the size of some of the sample_size values being in the 1000+ range we can assume the value is a percentage and not a  count
## Note: Another indication is that some of the values are not integers (Example: 26.8)

# Remove the created Column as no longer required
dataImp <- subset(dataImp, select=-c(Total))

# Therefore in naming the column for the responses we should clearly indicate that it's a Percentage

dataImp <-rename(dataImp, poll_question = text, poll_url = url, response_perc_very = very, response_perc_somewhat = somewhat, response_perc_not_very = not_very, response_perc_not_at_all = not_at_all )
                 
# Check if the conversion of the column to factors is applied correctly
names(dataImp)

##  [1] "start_date"               "end_date"                
##  [3] "pollster"                 "sponsor"                 
##  [5] "sample_size"              "population"              
##  [7] "party"                    "subject"                 
##  [9] "tracking"                 "poll_question"           
## [11] "response_perc_very"       "response_perc_somewhat"  
## [13] "response_perc_not_very"   "response_perc_not_at_all"
## [15] "poll_url"

Tidy data

Check if the data conforms the tidy data principles. If your data is untidy, reshape your data into a tidy format. If the data is in a tidy format, you will be expected to explain why the data is originally ‘tidy’.

# This is a chunk where you check if the data conforms the tidy data principles and reshape your data into a tidy format.

# Data conforms to Tidy Data Principals
# 1: Each variable must have its own column - True
# 2: Each observation must have its own row - True
# 3: Each value must have its own cell      - True

# Data was sourced in this format and other than the  steps performed above to redefined columns to particular data types 
# (Dates and Factors) and renaming of columns there has been no need for any further tidying

Summary statistics

Provide summary statistics (mean, median, minimum, maximum, standard deviation) of numeric variables grouped by one of the qualitative (categorical) variable. For example, if your categorical variable is age groups and quantitative variable is income, provide summary statistics of income grouped by the age groups.

# This is a chunk where you provide summary statistics

# Generate the summary statistics based on categorical value "subject' and poll response 'response_perc_very' 
dataImp%>%
group_by(Poll_Subject = subject)%>% 
summarise(Mean=mean(response_perc_very, na.rm = TRUE), 
          Median=median(response_perc_very, na.rm = TRUE),
          Minimum=min(response_perc_very, na.rm = TRUE), 
          Maximum=max(response_perc_very, na.rm = TRUE), 
          Standard_Deviation=sd(response_perc_very, na.rm = TRUE))

## # A tibble: 2 x 6
##   Poll_Subject      Mean Median Minimum Maximum Standard_Deviation
##   <fct>            <dbl>  <dbl>   <dbl>   <dbl>              <dbl>
## 1 concern-economy   52.3     52      19      73               9.62
## 2 concern-infected  30.8     30       8      62               9.87

# Summary Statistics generated

Create a list

Create a list that contains a numeric value for each response to the categorical variable. Typically, they are numbered from 1-n.

# This is a chunk where you create a list

# Create a List based on the column 'subject' factor values
subjectList <- list(as.numeric(dataImp$subject), dataImp$subject) 

# Display the List Structure
str(subjectList)

## List of 2
##  $ : num [1:678] 1 1 2 1 2 2 2 1 2 2 ...
##  $ : Factor w/ 2 levels "concern-economy",..: 1 1 2 1 2 2 2 1 2 2 ...

Join the list

Join this list on using a join of your choice. Remember that this has to keep the numeric variable, as well as matching to your categorical variable.

# This is a chunk where you join the list

# Create a dataframe from the newly created list 
subjectListDf<- as.data.frame(subjectList)

# Determine the column names of the dataframe
colnames(subjectListDf)

## [1] "c.1..1..2..1..2..2..2..1..2..2..2..1..1..1..2..1..1..2..2..1.."  
## [2] "structure.c.1L..1L..2L..1L..2L..2L..2L..1L..2L..2L..2L..1L..1L.."

# Rename the Columns
subjectListDf <-rename(subjectListDf, subject_levels = "c.1..1..2..1..2..2..2..1..2..2..2..1..1..1..2..1..1..2..2..1..")
subjectListDf <-rename(subjectListDf, subject = "structure.c.1L..1L..2L..1L..2L..2L..2L..1L..2L..2L..2L..1L..1L..")

# Check the column names are correctly updated
colnames(subjectListDf)

## [1] "subject_levels" "subject"

# Now Remove the Duplicates otherwise you'll get thousands of Join Results (Around 250,000+)
subjectListDf <- subjectListDf[!duplicated(subjectListDf$subject_levels),]

# Show the content of the deduplicated dataframe
str(subjectListDf)

## 'data.frame':    2 obs. of  2 variables:
##  $ subject_levels: num  1 2
##  $ subject       : Factor w/ 2 levels "concern-economy",..: 1 2

# We now have 2 Unique factor values to work with

# Now join the Original dataframe to the 'subject' factor based dataframe 
newDataFrameFromList<- inner_join(dataImp,subjectListDf,by = "subject", copy=TRUE)

# Check the structure of the new Dataframe, new column 'subject_levels' with be at the further most right column position
str(newDataFrameFromList)

## 'data.frame':    678 obs. of  16 variables:
##  $ start_date              : Date, format: "2020-01-27" "2020-01-31" ...
##  $ end_date                : Date, format: "2020-01-29" "2020-02-02" ...
##  $ pollster                : Factor w/ 41 levels "Optimus","ABC",..: 24 24 41 24 41 41 3 18 18 41 ...
##  $ sponsor                 : Factor w/ 35 levels "ABC News","American Enterprise Institute",..: NA NA 12 NA 17 12 NA NA NA 12 ...
##  $ sample_size             : int  2202 2202 1500 2200 1000 1500 1074 1207 1207 1500 ...
##  $ population              : Factor w/ 3 levels "Adults","Likely Voters",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ party                   : Factor w/ 1 level "all": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject                 : Factor w/ 2 levels "concern-economy",..: 1 1 2 1 2 2 2 1 2 2 ...
##  $ tracking                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ poll_question           : Factor w/ 83 levels "And thinking now about the coronavirus outbreak, how worried are you that you or someone in your family will co"| __truncated__,..: 27 27 76 27 30 76 47 36 37 76 ...
##  $ response_perc_very      : num  19 26 13 23 11 11 22 22 22 10 ...
##  $ response_perc_somewhat  : num  33 32 26 32 24 28 23 35 21 28 ...
##  $ response_perc_not_very  : num  23 25 43 24 33 39 37 28 33 42 ...
##  $ response_perc_not_at_all: num  11 7 18 9 20 22 19 15 23 19 ...
##  $ poll_url                : Factor w/ 511 levels "http://apnorc.org/PDFs/AP-NORC%20June%202020/Topline_covid.pdf",..: 270 271 87 272 354 88 8 4 4 89 ...
##  $ subject_levels          : num  1 1 2 1 2 2 2 1 2 2 ...

Subsetting I

Subset the data frame using first 10 observations (include all variables). Then convert it to a matrix. Check the structure of that matrix (i.e. check whether the matrix is character, numeric, integer, factor, or logical) and explain in a few words why you ended up with that structure.

# This is a chunk to subset your data and convert it to a matrix 

# We take the first 10 rows from our dataframe
dataImpSubSet01 <- dataImp[1:10,]

# Check the structure of the generated dataframe
str(dataImpSubSet01)

## 'data.frame':    10 obs. of  15 variables:
##  $ start_date              : Date, format: "2020-01-27" "2020-01-31" ...
##  $ end_date                : Date, format: "2020-01-29" "2020-02-02" ...
##  $ pollster                : Factor w/ 41 levels "Optimus","ABC",..: 24 24 41 24 41 41 3 18 18 41
##  $ sponsor                 : Factor w/ 35 levels "ABC News","American Enterprise Institute",..: NA NA 12 NA 17 12 NA NA NA 12
##  $ sample_size             : int  2202 2202 1500 2200 1000 1500 1074 1207 1207 1500
##  $ population              : Factor w/ 3 levels "Adults","Likely Voters",..: 1 1 1 1 1 1 1 1 1 1
##  $ party                   : Factor w/ 1 level "all": 1 1 1 1 1 1 1 1 1 1
##  $ subject                 : Factor w/ 2 levels "concern-economy",..: 1 1 2 1 2 2 2 1 2 2
##  $ tracking                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ poll_question           : Factor w/ 83 levels "And thinking now about the coronavirus outbreak, how worried are you that you or someone in your family will co"| __truncated__,..: 27 27 76 27 30 76 47 36 37 76
##  $ response_perc_very      : num  19 26 13 23 11 11 22 22 22 10
##  $ response_perc_somewhat  : num  33 32 26 32 24 28 23 35 21 28
##  $ response_perc_not_very  : num  23 25 43 24 33 39 37 28 33 42
##  $ response_perc_not_at_all: num  11 7 18 9 20 22 19 15 23 19
##  $ poll_url                : Factor w/ 511 levels "http://apnorc.org/PDFs/AP-NORC%20June%202020/Topline_covid.pdf",..: 270 271 87 272 354 88 8 4 4 89

# Now convert the subset dataframe to a Matrix
dataImpMatrix01 <- matrix(dataImpSubSet01)

# Check the structure of the generated matrix
str(dataImpMatrix01)

## List of 15
##  $ : Date[1:10], format: "2020-01-27" "2020-01-31" ...
##  $ : Date[1:10], format: "2020-01-29" "2020-02-02" ...
##  $ : Factor w/ 41 levels "Optimus","ABC",..: 24 24 41 24 41 41 3 18 18 41
##  $ : Factor w/ 35 levels "ABC News","American Enterprise Institute",..: NA NA 12 NA 17 12 NA NA NA 12
##  $ : int [1:10] 2202 2202 1500 2200 1000 1500 1074 1207 1207 1500
##  $ : Factor w/ 3 levels "Adults","Likely Voters",..: 1 1 1 1 1 1 1 1 1 1
##  $ : Factor w/ 1 level "all": 1 1 1 1 1 1 1 1 1 1
##  $ : Factor w/ 2 levels "concern-economy",..: 1 1 2 1 2 2 2 1 2 2
##  $ : logi [1:10] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ : Factor w/ 83 levels "And thinking now about the coronavirus outbreak, how worried are you that you or someone in your family will co"| __truncated__,..: 27 27 76 27 30 76 47 36 37 76
##  $ : num [1:10] 19 26 13 23 11 11 22 22 22 10
##  $ : num [1:10] 33 32 26 32 24 28 23 35 21 28
##  $ : num [1:10] 23 25 43 24 33 39 37 28 33 42
##  $ : num [1:10] 11 7 18 9 20 22 19 15 23 19
##  $ : Factor w/ 511 levels "http://apnorc.org/PDFs/AP-NORC%20June%202020/Topline_covid.pdf",..: 270 271 87 272 354 88 8 4 4 89
##  - attr(*, "dim")= int [1:2] 15 1

# Determine the 
class(dataImpMatrix01)

## [1] "matrix" "array"

# The matrix is identified as a matrix of type array

# Looking at the individual elements of the matrix

class(dataImpMatrix01[1])

## [1] "list"

class(dataImpMatrix01[6])

## [1] "list"

class(dataImpMatrix01[9])

## [1] "list"

# each of the elements is defined as a list and this possibly gives it a 3 dimensional definition and hence 'array' is assigned

Subsetting II

Subset the data frame including only first and the last variable in the data set, save it as an R object file (.RData). Provide the R codes with outputs and explain everything that you do in this step.

# This is a chunk to subset your data and convert it to an R object file 
# We take the first 10 rows from our dataframe

# Count the number of Rows in the Dataframe
numberRows <- nrow(dataImp)

# Check the value you get aligns to what is expected (678)
str(numberRows)

##  int 678

#numberRows <- as.numeric(numberRows)

# Extract Row 1 and View
dataImpSubSetFirst <- dataImp[1,]
str(dataImpSubSetFirst)

## 'data.frame':    1 obs. of  15 variables:
##  $ start_date              : Date, format: "2020-01-27"
##  $ end_date                : Date, format: "2020-01-29"
##  $ pollster                : Factor w/ 41 levels "Optimus","ABC",..: 24
##  $ sponsor                 : Factor w/ 35 levels "ABC News","American Enterprise Institute",..: NA
##  $ sample_size             : int 2202
##  $ population              : Factor w/ 3 levels "Adults","Likely Voters",..: 1
##  $ party                   : Factor w/ 1 level "all": 1
##  $ subject                 : Factor w/ 2 levels "concern-economy",..: 1
##  $ tracking                : logi FALSE
##  $ poll_question           : Factor w/ 83 levels "And thinking now about the coronavirus outbreak, how worried are you that you or someone in your family will co"| __truncated__,..: 27
##  $ response_perc_very      : num 19
##  $ response_perc_somewhat  : num 33
##  $ response_perc_not_very  : num 23
##  $ response_perc_not_at_all: num 11
##  $ poll_url                : Factor w/ 511 levels "http://apnorc.org/PDFs/AP-NORC%20June%202020/Topline_covid.pdf",..: 270

# Extract the last rows
dataImpSubSetLast <- dataImp[numberRows,]
str(dataImpSubSetLast)

## 'data.frame':    1 obs. of  15 variables:
##  $ start_date              : Date, format: "2021-03-08"
##  $ end_date                : Date, format: "2021-03-30"
##  $ pollster                : Factor w/ 41 levels "Optimus","ABC",..: 28
##  $ sponsor                 : Factor w/ 35 levels "ABC News","American Enterprise Institute",..: 17
##  $ sample_size             : int 5625
##  $ population              : Factor w/ 3 levels "Adults","Likely Voters",..: 1
##  $ party                   : Factor w/ 1 level "all": 1
##  $ subject                 : Factor w/ 2 levels "concern-economy",..: 2
##  $ tracking                : logi FALSE
##  $ poll_question           : Factor w/ 83 levels "And thinking now about the coronavirus outbreak, how worried are you that you or someone in your family will co"| __truncated__,..: 65
##  $ response_perc_very      : num 23
##  $ response_perc_somewhat  : num 39
##  $ response_perc_not_very  : num 27
##  $ response_perc_not_at_all: num 11
##  $ poll_url                : Factor w/ 511 levels "http://apnorc.org/PDFs/AP-NORC%20June%202020/Topline_covid.pdf",..: 472

# Combine the two dataframe using rbind
dataSubsetCombined <- rbind(dataImpSubSetFirst, dataImpSubSetLast)
str(dataSubsetCombined)

## 'data.frame':    2 obs. of  15 variables:
##  $ start_date              : Date, format: "2020-01-27" "2021-03-08"
##  $ end_date                : Date, format: "2020-01-29" "2021-03-30"
##  $ pollster                : Factor w/ 41 levels "Optimus","ABC",..: 24 28
##  $ sponsor                 : Factor w/ 35 levels "ABC News","American Enterprise Institute",..: NA 17
##  $ sample_size             : int  2202 5625
##  $ population              : Factor w/ 3 levels "Adults","Likely Voters",..: 1 1
##  $ party                   : Factor w/ 1 level "all": 1 1
##  $ subject                 : Factor w/ 2 levels "concern-economy",..: 1 2
##  $ tracking                : logi  FALSE FALSE
##  $ poll_question           : Factor w/ 83 levels "And thinking now about the coronavirus outbreak, how worried are you that you or someone in your family will co"| __truncated__,..: 27 65
##  $ response_perc_very      : num  19 23
##  $ response_perc_somewhat  : num  33 39
##  $ response_perc_not_very  : num  23 27
##  $ response_perc_not_at_all: num  11 11
##  $ poll_url                : Factor w/ 511 levels "http://apnorc.org/PDFs/AP-NORC%20June%202020/Topline_covid.pdf",..: 270 472

# Set the Work Directory for the output of Step 10
setwd("C:/Users/I327371/Documents/R")

# Save the combined dataframe to the Output Directory
save(dataSubsetCombined, file = "output/DWAssign_01.RData")

Create a new Data Frame

Create a data frame with 2 variables. Your data frame has to contain one integer variable and one ordinal variable.

The ordinal variable has to be a factor and ordered properly. Make sure you name your variables.
Show the structure of your variables and the levels of the ordinal variable.
Create another numeric vector and use cbind() to add this vector to your data frame.
After this step you should have 3 variables in the data frame.
Check the attributes and the dimension of your new data frame.
Provide the R codes with outputs and explain everything that you do in this step.

# This is a chunk to create a new data frame with the given specifications

# Create the Data Frame (1 Numeric and 1 Ordinal (Factored))
dfStep11 <- data.frame(
   count = as.numeric(c (10, 20, 25, 37, 45)), 
   temp = as.factor(c("Very Low","Low","Medium","High","Very High")))
   
# Set the Order for the Ordinal
dfStep11$temp = ordered(dfStep11$temp, levels = c("Very Low","Low","Medium","High","Very High"))

# Check the Structure of the new Data Frame
str(dfStep11)

## 'data.frame':    5 obs. of  2 variables:
##  $ count: num  10 20 25 37 45
##  $ temp : Ord.factor w/ 5 levels "Very Low"<"Low"<..: 1 2 3 4 5

# Now create a separate Vector with just a numeric
vecNumericStep11 <- as.numeric(c (20, 40, 50, 74, 90))

# Check the Structure of the new Vector
str(vecNumericStep11)

##  num [1:5] 20 40 50 74 90

# Combine the Data Frame and Vector
step11Combined <- cbind(dfStep11, vecNumericStep11)

# Check the Structure of the combined Data Framen and Vector
str(step11Combined)

## 'data.frame':    5 obs. of  3 variables:
##  $ count           : num  10 20 25 37 45
##  $ temp            : Ord.factor w/ 5 levels "Very Low"<"Low"<..: 1 2 3 4 5
##  $ vecNumericStep11: num  20 40 50 74 90

Create another Data Frame

Create another data frame with a common variable to the dataset created in step 11.

Join the data frame to the dataset above, and ensure that the dataset is joined properly.
Ensuring the new categorical variable is carried to the larger dataset. Eg. A dataset to join could be State, Abbreviation, Municipality, Prevailing Religion.
Provide the R codes with outputs and explain everything that you do in this step.

# This is a chunk to create another data frame with the given specifications

# Create a Data Frame that can be joined with Data Frame from Step 11))
dfStep12 <- data.frame(
   temp    = as.factor(c("Very Low","Low","Medium","High","Very High")),
   loc     = c("Hobart", "Sydney", "Melbourne", "Canberra", "Brisbane"),
   state   = c("TAS", "NSW", "VIC", "ACT", "QLD"),
   weather = c("Woeful", "OK-Ish", "Terrible", "Chilly and Hot", "Just Hot"))

# Set the Order for the Ordinal
dfStep12$temp = ordered(dfStep12$temp, levels = c("Very Low","Low","Medium","High","Very High"))

# Check the Structure of the new Data Frame
str(dfStep12)

## 'data.frame':    5 obs. of  4 variables:
##  $ temp   : Ord.factor w/ 5 levels "Very Low"<"Low"<..: 1 2 3 4 5
##  $ loc    : chr  "Hobart" "Sydney" "Melbourne" "Canberra" ...
##  $ state  : chr  "TAS" "NSW" "VIC" "ACT" ...
##  $ weather: chr  "Woeful" "OK-Ish" "Terrible" "Chilly and Hot" ...

# Now Join the new Data Frame to that from Step 11
newDfStep12 <- inner_join(step11Combined,dfStep12,by = "temp", copy=TRUE)

# Check the Structure of the new Joined Data Frame
str(newDfStep12)

## 'data.frame':    5 obs. of  6 variables:
##  $ count           : num  10 20 25 37 45
##  $ temp            : Ord.factor w/ 5 levels "Very Low"<"Low"<..: 1 2 3 4 5
##  $ vecNumericStep11: num  20 40 50 74 90
##  $ loc             : chr  "Hobart" "Sydney" "Melbourne" "Canberra" ...
##  $ state           : chr  "TAS" "NSW" "VIC" "ACT" ...
##  $ weather         : chr  "Woeful" "OK-Ish" "Terrible" "Chilly and Hot" ...

IMPORTANT NOTE:

The report must be uploaded to Assignment 1 section in Canvas as a PDF document with R codes and outputs showing. The easiest way to achieve this is to run all R chunks first, then Preview your notebook in HTML (by clicking Preview), then Open in Browser (Chrome), then Right Click on the report in Chrome , then Click Print and Select the Destination Option to Save as PDF. Upload this PDF report as one single file via the Assignment 1 page in CANVAS.

DELETE the instructional text provided in the template. Failure to do this will INCREASE the SIMILARITY INDEX reported in TURNITIN If you have any questions regarding the assignment instructions and the R template, please post it on Canvas discussion.