Load Data
data <- read_csv("covid_data_log_200922.csv")head(data)The data above was pulled from Kaggle, and represents various statistics around COVID in the USA. Each observation represents a particular county, and features include the total number of cases, total number of deaths, abnd other features related to gender counts, poverty rates, etc. The dataset also has two features āRisk_Indexā and āRisk_Catā which are presumably the response variables. In other words, a more advanced machine learning model could utilize this dataset to create a prediction for a given county as to whether or not their Covid related risks are higher.
For this assignment, I am interested in seeing if there is a linear relationship between: - a countyās poverty rate - and the number of reported COVID cases rate
My own intuition would lead me to believe that there is a positive relationship. Higher rates of poverty would suggest lower rates to adequate medical care, as well as lower rates of access to affordable housing. With this in mind, it is reasonable to assume that these countyās will have more people vulnerable to the virus, and more active in spreading it.
Clean and Transform Data
Checking our features / Cleaning and Sanity
The current data has log transformed values of: - total number of covid cases for a county within a given timeframe - total count of individuals classified as living in poverty in the county in 2019 - total population of the county in 2019
I want to compare poverty rates against COVID report rates, so we perform transformations
# checking key features for NA values
print(sum(is.na(data$Poverty)))## [1] 0
print(sum(is.na(data$Cases)))## [1] 0
print(sum(is.na(data$Population)))## [1] 0
data$population_invlog <- exp(data$Population) #inverse log population
data$poverty_invlog <- exp(data$Poverty) #inverse log poverty
data$cases_invlog <- exp(data$Cases) #inverse log #cases# creating poverty ratio and cases ratio
data <-data %>%
mutate(
poverty_ratio = poverty_invlog/population_invlog,
cases_ratio = cases_invlog/population_invlog
) %>%
select(cases_invlog, cases_ratio, poverty_invlog, poverty_ratio, population_invlog)summary(data)## cases_invlog cases_ratio poverty_invlog poverty_ratio
## Min. : 1 Min. : 0.000113 Min. : 152 Min. : 0.499
## 1st Qu.: 1770 1st Qu.: 0.116527 1st Qu.: 10606 1st Qu.: 0.961
## Median : 7187 Median : 0.250616 Median : 24934 Median : 0.978
## Mean : 75945 Mean : 0.442308 Mean : 101682 Mean : 2.076
## 3rd Qu.: 28644 3rd Qu.: 0.521544 3rd Qu.: 65540 3rd Qu.: 0.988
## Max. :8364024 Max. :11.851648 Max. :9951338 Max. :3487.105
## population_invlog
## Min. : 86
## 1st Qu.: 10902
## Median : 25726
## Mean : 104468
## 3rd Qu.: 68073
## Max. :10039107
First, there should be no poverty_ratios greater than 1. We will remove those.
data <- data %>%
filter(
poverty_ratio < 1
)data <- data %>%
filter(
cases_ratio < 1
)There may still be some outliers, we can create a function to identify by z_score
identify_outliers <- function(column){
column_z = (column - mean(column)) / sd(column)
return(column_z)
}# Identify outliers function produces a an array of associated z-scores for a column
# we then convert that array into a boolean, TRUE for anything below 3
data<- data[abs(identify_outliers(data$poverty_ratio)) < 3,]
data<- data[abs(identify_outliers(data$cases_ratio)) < 3,]Visualize Relationship
data %>%
ggplot(aes(x=poverty_ratio, y=cases_ratio)) +
geom_point() +
geom_smooth(method="lm")## `geom_smooth()` using formula 'y ~ x'
It doesnāt appear that there is a strong association with poverty_rate and cases_rate based on the aboe data. There is a very slight negative trend between poverty_ratio and cases_ratio. There is also an skew in poverty_ratio, where a large proportion of values being above 95%.
Building an initial Linear Model
linear_model <- lm(cases_ratio ~ poverty_ratio, data=data)
summary(linear_model)##
## Call:
## lm(formula = cases_ratio ~ poverty_ratio, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.32771 -0.18429 -0.06944 0.13303 0.70894
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6375 0.1649 3.865 0.000114 ***
## poverty_ratio -0.3501 0.1702 -2.057 0.039829 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2352 on 2614 degrees of freedom
## Multiple R-squared: 0.001615, Adjusted R-squared: 0.001233
## F-statistic: 4.229 on 1 and 2614 DF, p-value: 0.03983
Looking at the above, we can see that poverty_rate alone is not an adequate predictor of cases_rate for a given county.
Plotting the residuals
ggplot(data = linear_model, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")There is more work to be done hereā¦