Load Data

data <- read_csv("covid_data_log_200922.csv")
head(data)

The data above was pulled from Kaggle, and represents various statistics around COVID in the USA. Each observation represents a particular county, and features include the total number of cases, total number of deaths, abnd other features related to gender counts, poverty rates, etc. The dataset also has two features ā€œRisk_Indexā€ and ā€œRisk_Catā€ which are presumably the response variables. In other words, a more advanced machine learning model could utilize this dataset to create a prediction for a given county as to whether or not their Covid related risks are higher.

For this assignment, I am interested in seeing if there is a linear relationship between: - a county’s poverty rate - and the number of reported COVID cases rate

My own intuition would lead me to believe that there is a positive relationship. Higher rates of poverty would suggest lower rates to adequate medical care, as well as lower rates of access to affordable housing. With this in mind, it is reasonable to assume that these county’s will have more people vulnerable to the virus, and more active in spreading it.

Clean and Transform Data

Checking our features / Cleaning and Sanity

The current data has log transformed values of: - total number of covid cases for a county within a given timeframe - total count of individuals classified as living in poverty in the county in 2019 - total population of the county in 2019

I want to compare poverty rates against COVID report rates, so we perform transformations

# checking key features for NA values
print(sum(is.na(data$Poverty)))
## [1] 0
print(sum(is.na(data$Cases)))
## [1] 0
print(sum(is.na(data$Population)))
## [1] 0
data$population_invlog <- exp(data$Population) #inverse log population
data$poverty_invlog <- exp(data$Poverty) #inverse log poverty
data$cases_invlog <- exp(data$Cases) #inverse log #cases
# creating poverty ratio and cases ratio
data <-data %>%
  mutate(
    poverty_ratio = poverty_invlog/population_invlog,
    cases_ratio = cases_invlog/population_invlog
  ) %>%
  select(cases_invlog, cases_ratio, poverty_invlog, poverty_ratio, population_invlog)
summary(data)
##   cases_invlog      cases_ratio        poverty_invlog    poverty_ratio     
##  Min.   :      1   Min.   : 0.000113   Min.   :    152   Min.   :   0.499  
##  1st Qu.:   1770   1st Qu.: 0.116527   1st Qu.:  10606   1st Qu.:   0.961  
##  Median :   7187   Median : 0.250616   Median :  24934   Median :   0.978  
##  Mean   :  75945   Mean   : 0.442308   Mean   : 101682   Mean   :   2.076  
##  3rd Qu.:  28644   3rd Qu.: 0.521544   3rd Qu.:  65540   3rd Qu.:   0.988  
##  Max.   :8364024   Max.   :11.851648   Max.   :9951338   Max.   :3487.105  
##  population_invlog 
##  Min.   :      86  
##  1st Qu.:   10902  
##  Median :   25726  
##  Mean   :  104468  
##  3rd Qu.:   68073  
##  Max.   :10039107

First, there should be no poverty_ratios greater than 1. We will remove those.

data <- data %>%
  filter(
    poverty_ratio < 1
  )
data <- data %>%
  filter(
    cases_ratio < 1
  )

There may still be some outliers, we can create a function to identify by z_score

identify_outliers <- function(column){
  column_z = (column - mean(column)) / sd(column)
  return(column_z)
}
# Identify outliers function produces a an array of associated z-scores for a column
# we then convert that array into a boolean, TRUE for anything below 3

data<- data[abs(identify_outliers(data$poverty_ratio)) < 3,]
data<- data[abs(identify_outliers(data$cases_ratio)) < 3,]

Visualize Relationship

data %>%
  ggplot(aes(x=poverty_ratio, y=cases_ratio)) +
  geom_point() +
  geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'

It doesn’t appear that there is a strong association with poverty_rate and cases_rate based on the aboe data. There is a very slight negative trend between poverty_ratio and cases_ratio. There is also an skew in poverty_ratio, where a large proportion of values being above 95%.

Building an initial Linear Model

linear_model <- lm(cases_ratio ~ poverty_ratio, data=data)
summary(linear_model)
## 
## Call:
## lm(formula = cases_ratio ~ poverty_ratio, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32771 -0.18429 -0.06944  0.13303  0.70894 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.6375     0.1649   3.865 0.000114 ***
## poverty_ratio  -0.3501     0.1702  -2.057 0.039829 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2352 on 2614 degrees of freedom
## Multiple R-squared:  0.001615,   Adjusted R-squared:  0.001233 
## F-statistic: 4.229 on 1 and 2614 DF,  p-value: 0.03983

Looking at the above, we can see that poverty_rate alone is not an adequate predictor of cases_rate for a given county.

Plotting the residuals

ggplot(data = linear_model, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

There is more work to be done here…