Project Two

library(tidyverse)

## -- Attaching packages -------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ----------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)
library(ggplot2)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

I choosed to work with the Maternal Mortality dataset.

setwd("C:/Users/munis/Documents/Comm in Data Science")
mortality <- read_csv("maternalmortality.csv")

## Parsed with column specification:
## cols(
##   State = col_character(),
##   MMR = col_double(),
##   Prenatal = col_double(),
##   Csection = col_double(),
##   Underserved = col_double(),
##   Uninsured = col_double(),
##   Population_18 = col_double()
## )

I used dplyr to select the variables I would work with from the data set.

mortality <- mortality %>% 
  select(MMR, Csection, Underserved, Uninsured, Population_18)
head(mortality)

## # A tibble: 6 x 5
##     MMR Csection Underserved Uninsured Population_18
##   <dbl>    <dbl>       <dbl>     <dbl>         <dbl>
## 1   9.6     33.8          55      18.1       4887871
## 2   5       22.6          50      19.8        737438
## 3   7.2     26.2          51      22.3       7171646
## 4  14.6     34.8          34      23.3       3013825
## 5  11.3     32.1          49      20.9      39557045
## 6  11       25.8          42      18         5695564

I graph the effect of the percentage of C-Sections on the Maternal Mortality rate in each state. I also create a regression line to determine a trend.

ggplot(data = mortality, aes(x = Csection, y = MMR)) + 
  geom_point() +
  geom_smooth(method = "lm") +
  ggtitle("C-Section Percentage vs. Maternal Mortality Rate")

To check the correlation of the two variables, I find the summary of my regression line and I look at the adjusted R-Squared. Unfortunately, the value is extremely small. Nonetheless, I continue on.

line1 <- lm(Csection ~ MMR, data = mortality)
summary(line1)

## 
## Call:
## lm(formula = Csection ~ MMR, data = mortality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8555 -2.1984 -0.0618  2.5403  7.4323 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.26924    1.11055  25.455   <2e-16 ***
## MMR          0.22995    0.09651   2.383   0.0211 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.667 on 49 degrees of freedom
## Multiple R-squared:  0.1038, Adjusted R-squared:  0.08555 
## F-statistic: 5.678 on 1 and 49 DF,  p-value: 0.02111

Now, I graph the percentage of births in medically underserved areas and its effect on each states mortality rate. There didn’t seem to be much correlation.

maternal_mortality <- ggplot(data = mortality, aes(x = Underserved, y = MMR)) + 
  geom_point() +
  geom_smooth(method = "lm") +
  ggtitle("Underserved Percentage vs. Maternal Mortality Rate")
maternal_mortality

Indeed, there was no correlation.

line2 <- lm(Underserved ~ MMR, data = mortality)
summary(line2)

## 
## Call:
## lm(formula = Underserved ~ MMR, data = mortality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.222  -4.803   1.838   6.723  16.617 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42.1787     2.4562  17.172   <2e-16 ***
## MMR           0.1305     0.2134   0.611    0.544    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.109 on 49 degrees of freedom
## Multiple R-squared:  0.007566,   Adjusted R-squared:  -0.01269 
## F-statistic: 0.3736 on 1 and 49 DF,  p-value: 0.5439

I did the same thing as above but with the percentage of women who were uninsured.

ggplot(data = mortality, aes(x = Uninsured, y = MMR)) + 
  geom_point() +
  geom_smooth(method = "lm") +
  ggtitle("Uninsured Percentage vs. Maternal Mortality Rate")

Again, there was very little correlation.

line3 <- lm(Uninsured ~ MMR, data = mortality)
summary(line3)

## 
## Call:
## lm(formula = Uninsured ~ MMR, data = mortality)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.5322  -3.8249  -0.4974   3.3754  11.7157 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.1394     1.3870  10.194 1.06e-13 ***
## MMR           0.2262     0.1205   1.876   0.0666 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.579 on 49 degrees of freedom
## Multiple R-squared:  0.06703,    Adjusted R-squared:  0.04799 
## F-statistic:  3.52 on 1 and 49 DF,  p-value: 0.06658

Nonetheless, I decide to graph the percentage of women who have C-sections and its effect on the population for each state. I graphed a regression line and I also made the size of each point relate to its population. The bigger the population, the larger the point would be. Also, I used plotly to make the graph interactive. If you hover over it, you can read the actual values of each point.

mortality_graph <- ggplot(data = mortality, aes(x = Csection, y = MMR, size = Population_18)) + 
  ggtitle("C-Section Percentage vs. Maternal Mortality Rate") +
  theme(legend.position = "right") +
  geom_point(alpha = 0.5, color = "blue") + ylim(0,22) + theme_light(base_size = 11) +
  geom_smooth(method = "lm", se = FALSE, color = "red")
mortality_graph <- ggplotly(mortality_graph)

## Warning: Removed 1 rows containing non-finite values (stat_smooth).

mortality_graph

I created this graph to see which variables were correlated to each other but I found out that none of them are that correlated. In fact, the highest correlation coefficient was only 0.346, which signifies weak correlation.

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## 
## Attaching package: 'GGally'

## The following object is masked from 'package:dplyr':
## 
##     nasa

ggpairs(mortality, columns = 1:5)

These four plots give us a lot of information. The first shows that the data is in fact linear, but with a major outlier. Point 51 is way above the general residuals. The point is Washington D.C and it shows just how high the maternal mortality rate is in that region due to poor conditions. Also, there is another extreme point which is number 10. This point is Georgia.

The second plot shows us if the residuals are the same throughout the X-Axis or not. Ideally, if the data was linear, the line should be horizontal. Although the line graphed isn’t perfectally horizontal, I would say that is straight enough to prove that the data is fairly linear.

The fourth and final graph shows us the influential points or outliers that can move the regression line. As you can see, points 51(Washington D.C.), 31(New Mexico), and 43(Texas) had the most extreme cases of maternal mortality.

fit2 <- lm(MMR ~ Csection + Underserved + Uninsured + Population_18, data = mortality)
summary(fit2)

## 
## Call:
## lm(formula = MMR ~ Csection + Underserved + Uninsured + Population_18, 
##     data = mortality)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.978 -2.659 -0.833  1.458 24.104 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   -9.722e+00  7.993e+00  -1.216   0.2301  
## Csection       4.617e-01  2.076e-01   2.224   0.0311 *
## Underserved    5.995e-02  1.007e-01   0.596   0.5543  
## Uninsured      2.187e-01  1.778e-01   1.230   0.2249  
## Population_18 -6.455e-08  1.070e-07  -0.603   0.5492  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.141 on 46 degrees of freedom
## Multiple R-squared:  0.1578, Adjusted R-squared:  0.08452 
## F-statistic: 2.154 on 4 and 46 DF,  p-value: 0.08922

plot(fit2)