Data 607 - Homework 1

Code and Comments

Load Libraries

If the libraries are not installed, use the install.packages(“package_name”) function.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.0.2

library(stringr)

## Warning: package 'stringr' was built under R version 4.0.2

Data Frame Creation Based on File in Github

The data is uploaded to a personal Github repository, which is downloaded and input into a data frame. From there the ‘head’ and ‘str’ functions are used to view key attributes of the data frame and ensure it loaded correctly.

df_orig <- read.csv("https://raw.githubusercontent.com/cwestsmith/cuny-msds/master/datasets/covid_concern_toplines.csv", header = TRUE)
head(df_orig, n = 5L)

##            subject modeldate party very_estimate somewhat_estimate
## 1 concern-infected 8/19/2020   all      34.23500          33.45410
## 2  concern-economy 8/19/2020   all      56.55267          30.71620
## 3 concern-infected 8/18/2020   all      34.67753          33.41458
## 4  concern-economy 8/18/2020   all      56.55267          30.71620
## 5  concern-economy 8/17/2020   all      57.24885          29.80712
##   not_very_estimate not_at_all_estimate      timestamp
## 1         17.796073           12.010591 19-08-20 10:15
## 2          8.068015            3.021646 19-08-20 10:15
## 3         17.534226           11.746447 18-08-20 19:50
## 4          8.068015            3.021646 18-08-20 19:50
## 5          8.518961            2.888446 17-08-20 22:45

str(df_orig)

## 'data.frame':    374 obs. of  8 variables:
##  $ subject            : chr  "concern-infected" "concern-economy" "concern-infected" "concern-economy" ...
##  $ modeldate          : chr  "8/19/2020" "8/19/2020" "8/18/2020" "8/18/2020" ...
##  $ party              : chr  "all" "all" "all" "all" ...
##  $ very_estimate      : num  34.2 56.6 34.7 56.6 57.2 ...
##  $ somewhat_estimate  : num  33.5 30.7 33.4 30.7 29.8 ...
##  $ not_very_estimate  : num  17.8 8.07 17.53 8.07 8.52 ...
##  $ not_at_all_estimate: num  12.01 3.02 11.75 3.02 2.89 ...
##  $ timestamp          : chr  "19-08-20 10:15" "19-08-20 10:15" "18-08-20 19:50" "18-08-20 19:50" ...

Creation of New Data Frame Including Subset of Original Columns

A new data frame is created containing a subset of the columns in the original source. In addition, two new columns are added which consolidate the various columns into total figures for “concerned” and “not concerned”. I referred to column names rather than numbers as I find it easier for readability. If using a higher number columns I would have opted for numbers though from a practical perspective.

df_new1 <- df_orig[, c("subject","modeldate","very_estimate","somewhat_estimate","not_very_estimate","not_at_all_estimate")]
df_new1$total_concerned <- df_new1$very_estimate + df_new1$somewhat_estimate
df_new1$total_not_concerned <- df_new1$not_very_estimate + df_new1$not_at_all_estimate

Rename Column

The ‘modeldate’ column name is replaced with ‘model_date’ for consistency with the other column names and improved readability. ‘Colnames’ is used to verify the change.

colnames(df_new1)[colnames(df_new1) %in% c("modeldate")] <- c("model_date")
colnames(df_new1)

## [1] "subject"             "model_date"          "very_estimate"      
## [4] "somewhat_estimate"   "not_very_estimate"   "not_at_all_estimate"
## [7] "total_concerned"     "total_not_concerned"

Convert Character Column to Date Column

The values were in two different formats (using ‘-’ and ‘/’ depending on the row). For the ‘as.Date’ function to properly work they first needed to be harmonized. The ‘class’ and ‘head’ functions were used to verify that the column was converted properly.

df_new1$model_date <- str_replace_all(df_new1$model_date, '-', '/')
df_new1$model_date <- as.Date(df_new1$model_date, format="%m/%d/%Y")
class(df_new1$model_date)

## [1] "Date"

head(df_new1$model_date, n = 5L)

## [1] "2020-08-19" "2020-08-19" "2020-08-18" "2020-08-18" "2020-08-17"

Plot % of Americans Concerned and Not Concerned Over Time

ggplot(data=df_new1, aes(x = model_date, y = total_concerned, group = subject, color = subject)) +
  scale_x_date(limits = as.Date(c("2020-02-01","2020-8-31"))) +
  ggtitle("% Americans Concerned Over Time")+
  ylab('% Concerned')+xlab('Date') +
  geom_line()

## Warning: Removed 144 row(s) containing missing values (geom_path).

ggplot(data=df_new1, aes(x = model_date, y = total_not_concerned, group = subject, color = subject)) +
  scale_x_date(limits = as.Date(c("2020-02-01","2020-8-31"))) +
  ggtitle("% Americans Not Concerned Over Time")+
  ylab('% Not Concerned')+xlab('Date') +
  geom_line()

## Warning: Removed 144 row(s) containing missing values (geom_path).

Data 607 - Homework 1

cwestsmith

8/19/2020

Overview