For this assignment I used a data set from fivethirtyeight focused on the concern of Americans on COVID-19 infections as well as on its impact on the economy. Although only one dataset was used for this assignment, serveral were included on the website. The article and data can be found at this link.
If the libraries are not installed, use the install.packages(“package_name”) function.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.2
library(stringr)
## Warning: package 'stringr' was built under R version 4.0.2
The data is uploaded to a personal Github repository, which is downloaded and input into a data frame. From there the ‘head’ and ‘str’ functions are used to view key attributes of the data frame and ensure it loaded correctly.
df_orig <- read.csv("https://raw.githubusercontent.com/cwestsmith/cuny-msds/master/datasets/covid_concern_toplines.csv", header = TRUE)
head(df_orig, n = 5L)
## subject modeldate party very_estimate somewhat_estimate
## 1 concern-infected 8/19/2020 all 34.23500 33.45410
## 2 concern-economy 8/19/2020 all 56.55267 30.71620
## 3 concern-infected 8/18/2020 all 34.67753 33.41458
## 4 concern-economy 8/18/2020 all 56.55267 30.71620
## 5 concern-economy 8/17/2020 all 57.24885 29.80712
## not_very_estimate not_at_all_estimate timestamp
## 1 17.796073 12.010591 19-08-20 10:15
## 2 8.068015 3.021646 19-08-20 10:15
## 3 17.534226 11.746447 18-08-20 19:50
## 4 8.068015 3.021646 18-08-20 19:50
## 5 8.518961 2.888446 17-08-20 22:45
str(df_orig)
## 'data.frame': 374 obs. of 8 variables:
## $ subject : chr "concern-infected" "concern-economy" "concern-infected" "concern-economy" ...
## $ modeldate : chr "8/19/2020" "8/19/2020" "8/18/2020" "8/18/2020" ...
## $ party : chr "all" "all" "all" "all" ...
## $ very_estimate : num 34.2 56.6 34.7 56.6 57.2 ...
## $ somewhat_estimate : num 33.5 30.7 33.4 30.7 29.8 ...
## $ not_very_estimate : num 17.8 8.07 17.53 8.07 8.52 ...
## $ not_at_all_estimate: num 12.01 3.02 11.75 3.02 2.89 ...
## $ timestamp : chr "19-08-20 10:15" "19-08-20 10:15" "18-08-20 19:50" "18-08-20 19:50" ...
A new data frame is created containing a subset of the columns in the original source. In addition, two new columns are added which consolidate the various columns into total figures for “concerned” and “not concerned”. I referred to column names rather than numbers as I find it easier for readability. If using a higher number columns I would have opted for numbers though from a practical perspective.
df_new1 <- df_orig[, c("subject","modeldate","very_estimate","somewhat_estimate","not_very_estimate","not_at_all_estimate")]
df_new1$total_concerned <- df_new1$very_estimate + df_new1$somewhat_estimate
df_new1$total_not_concerned <- df_new1$not_very_estimate + df_new1$not_at_all_estimate
The ‘modeldate’ column name is replaced with ‘model_date’ for consistency with the other column names and improved readability. ‘Colnames’ is used to verify the change.
colnames(df_new1)[colnames(df_new1) %in% c("modeldate")] <- c("model_date")
colnames(df_new1)
## [1] "subject" "model_date" "very_estimate"
## [4] "somewhat_estimate" "not_very_estimate" "not_at_all_estimate"
## [7] "total_concerned" "total_not_concerned"
The values were in two different formats (using ‘-’ and ‘/’ depending on the row). For the ‘as.Date’ function to properly work they first needed to be harmonized. The ‘class’ and ‘head’ functions were used to verify that the column was converted properly.
df_new1$model_date <- str_replace_all(df_new1$model_date, '-', '/')
df_new1$model_date <- as.Date(df_new1$model_date, format="%m/%d/%Y")
class(df_new1$model_date)
## [1] "Date"
head(df_new1$model_date, n = 5L)
## [1] "2020-08-19" "2020-08-19" "2020-08-18" "2020-08-18" "2020-08-17"
ggplot(data=df_new1, aes(x = model_date, y = total_concerned, group = subject, color = subject)) +
scale_x_date(limits = as.Date(c("2020-02-01","2020-8-31"))) +
ggtitle("% Americans Concerned Over Time")+
ylab('% Concerned')+xlab('Date') +
geom_line()
## Warning: Removed 144 row(s) containing missing values (geom_path).
ggplot(data=df_new1, aes(x = model_date, y = total_not_concerned, group = subject, color = subject)) +
scale_x_date(limits = as.Date(c("2020-02-01","2020-8-31"))) +
ggtitle("% Americans Not Concerned Over Time")+
ylab('% Not Concerned')+xlab('Date') +
geom_line()
## Warning: Removed 144 row(s) containing missing values (geom_path).
Of those surveyed, significantly more Americans are concerned about the effects of the pandemic on the economy than on being infected. Concern for both categories fell significantly in June but have been steadily rising again since then. It would be interesting to delve a bit deeper into the data and specific events associated with spikes and dips to determine what if any relationship there has been between global or national events (protests, stimulus payments, political speeches, etc…) and the survey data.