Introduction

Humans are products of genetics, but also products of their environments. This realization has prompted the question of what factors in areas such as work and health can affect the ultimate outcome for a person, which is life expectancy. By exploring variables within the realms of these two questions, observations about their effects on human lifespan can be made. To answer this question, I have formulate two separate hypotheses for each area, work and health.

Commuting zones with less access to healthcare in terms of insurance and hospital visits, and higher populations of unhealthy people, such as smokers and obese people, have decreased average life expectancies

Commuting zones with with lower labor force participation rates, declining economies, and higher populations working in manufacturing have decreased average life expectancies.

These assumptions carry weight, and it is important to use many tools to ensure they are correct. In this project, I plan to use simple linear regression to begin to analyze and separate the variables I use from Opportunity Insights (OI). OI is a non-profit that is dedicated to studying and finding solutions to economic inequality within America through data exploration. The OI website sources hundreds of datasets that “allow you to analyze social mobility and a variety of other outcomes from life expectancy to patent rates by neighborhood, college, parental income level, and racial background” (Opportunity Insights, 2018). To explore these questions, I used two datasets from OI that grouped life expectancy averages by commuting zones. To learn more about this approach, see the Data section.

To begin to use this data to answer these questions, I performed simple linear regressions to visualize and understand more basic relationships between various variables and life expectancy, which would help me later on when creating a multiple regression model.

Simple Regression Exploration

Before beginning to create a multiple regression model, it is important to understand the relationships that exist between the predictor and response variables within linear models, and to do this, we can use simple linear regression.

Health Variables

Obesity

To begin exploring the data, I decided to first explore the effect of rates of obesity within commuting zones on life expectancy within those CZ’s. The OI data groups both life expectancy and obesity rates by income quartile, so I chose to explore each of the four income quartiles through a scatterplot and linear regression output in R. Figures 1.1 through 1.4 show the scatterplots and regression outputs of life expectancy plotted on the fractional average rate of obesity within each respective income quartile.

By exploring each quartile separately, we can observe how rates of obesity effect each income tier differently. The regression outputs for each quartile are as follows (where x is the fractional rate of obesity):

** Quartile 1: (LE) = 79.9237 - 2.3808X **
** Quartile 2: (LE) = 83.3790 - 3.2225X **
** Quartile 3: (LE) = 85.0321 - 1.9639X **
** Quartile 4: (LE) = 86.9885 - 3.1784X **

Here, it is interesting to observe the Y-intercepts of each income quartile. As income level goes up, as does the intercept. The Y-intercept tells us the expected value of Y when X is zero. From looking at each individual quartile, we can see that as the expected value of life expectancy when the rate of obesity is zero increases by income. This tells us that as the fractional average of obese population within a commuting zone increases, life expectancy decreases, no matter the income quartile. The intercepts are also interesting, as there is no clear pattern between income quartile and coefficient. Keep in mind that the dataset uses the fractional average of obesity within a commuting zone.

To interpret the coefficient, we can use an example. In the South Boston commuting zone, the fractional rate of obesity within the first quartile is 0.285. Let’s substitute this value into the linear regression equation for the first income quartile:

79.9237 - 2.3808(0.285) = 79.245

This value tells us that within the first income quartile, if a commuting zone has an average obesity rate of 0.285, the predicted life expectancy of that commuting zone is 79.245 years. Now, let’s use this same value of 0.285 in the regression equation for the fourth income quartile:

86.9885 - 3.1784(0.285) = 86.0826

The difference between these two outputs is 6.8376 years, telling us that even if the rate of obesity within two different income quartiles is the same, the person living in the fourth income quartile has a longer predicted life expectancy.

As these results are statistically significant, (p = 0.0001238, p = 2.593e-08, p = 2.07e-05, p = 5.669e-09, in order of appearance), meaning that we can safely assume that there is a relationship between the life expectancy decreasing as the fractional average of obese people increases.

Smoking

Seeing as the smoking data was separated by income quartiles as well, I decided to complete a linear regression exploration of this variable as well. Figures 1.5 through 1.8 show the scatterplots and regression outputs of life expectancy plotted on the fractional rate of current smokers within each respective income quartile.

By exploring each quartile separately, we can observe how smoking rates effect each income tier differently. The regression outputs for each quartile are as follows (where x is the fractional rate of current smokers): ** Quartile 1: (LE) = 80.689 - 5.352X Quartile 2: (LE) = 83.1997 - 3.1559X Quartile 3: (LE) = 84.8161 - 1.7895 Quartile 4: (LE) = 86.7688 - 4.1519 **

Again, we see that as the income quartile increases, as does the Y-intercept, which is equal to the expected life expectancy when the rate of smokers is equal to zero. However, no matter the income quartile, the predicted value of one’s life expectancy decreases as the the rate of smoking increases. It is interesting to note the different coefficients, as there is no clear relationship between their values and income quartile, as there is with Y-intercept. The first income quartile’s coefficient of 5.352X tells us that in terms of this dataset, smoking has the greatest effect on life expectancy within the first income quartile than any other quartile.

To again interpret the coefficients, we can use an example. In the South Boston commuting zone, the current percentage of smokers within the first quartile is 0.252. Let’s substitute this value into the linear regression equation for the first income quartile:

80.689 - 5.352(0.252) = 79.3402

The difference between the mean value of life expectancy and the output when the percentage of smokers is 25.2% is 1.348. This tells us that as the percentage of smokers within a commuting zone increases, life expectancy decreases, leading us to the conclusion that smoking decreases life expectancy.

Because the p-values are statistically significant, (p = 2.238e-15, p = 3.022e-05, p = 0.002072, p = 8.884e-08), it is safe to make assumptions about the data. It is possible to assume that smoking affects people within the first income quartile more than any other quartile because of the emphasis tobacco companies place on marketing within lower income communities.

Percentage of Uninsured People

The next health related variable I chose to explore was the percentage of uninsured people within a commuting zone. I chose this specific variable to explore the decision to have health insurance is not one that a person always has the free will to choose to make. For example, a person can decide to start smoking, which is something that is detrimental to one’s health. A person can not always choose to have medical insurance, especially if they are unemployed and are not receiving benefits from a job, or if they simply cannot afford the skyrocketing costs of health insurance. Health insurance is very indicative of health in the way that it allows essential health services such as emergency room visits, laboratory tests, substance-abuse treatment, and prescription drugs to be accessible to those who cannot afford these services without assistance.

Although the health insurance variable is not grouped by income quartile like the prior two, I elected to again plot it among the life expectancy estimates of the four income quartiles to observe differences between them, and how not being uninsured affects different levels of income differently. These scatterplots and regression outputs are shown in Figures 1.9 to 1.12, and have the regression outputs as follows (where x is the percentage of uninsured people within a commuting zone):

** Quartile 1: (LE) = 79.6307 - 2.3431X Quartile 2: (LE) = 83.8513 - 7.4084X Quartile 3: (LE) = 86.1536 - 9.0602X Quartile 4: (LE) = 87.7835 - 8.4030X **

The trend of the Y-intercept increasing as income increases is again present. However, there is a change in trend of the coefficients. Here, the coefficient for Q1 is the smallest, while Q3 has the greatest coefficient, meaning that being uninsured has a greater effect on life expectancies within the 3rd and 4th quartiles, as they have the greatest coefficients of the four.

To interpret the coefficient, we can use an example. In the South Boston commuting zone, the percentage of uninsured people was 0.0501, or 5.01%. Let’s substitute this value into the linear regression equation for the first income quartile as well as for the fourth quartile:

Quartile 1: 79.6307 - 2.3431(0.0501) = 79.51 Difference from mean: 0.1207

Quartile 4: 87.7835 - 8.4030(0.0501) = 87.36 Difference from mean: 0.5153

Despite the percentage of uninsured people being the same, a 5% rate of uninsured people had a greater effect on the life expectancies of people within the fourth income quartile. Here, the p-values are statistically significant (p = 0.01022, p = 2.2e-16, p = 2.2e-16, p = 2.2e-16), meaning we can make assumptions about the results. It can be assumed that a greater population of people within the 3rd and 4th income quartiles are able to afford health insurance, which means that the majority of them have it. I feel that the coefficients are higher for these quartiles because they can afford it.

Work/Labor Variables

Fraction of People Working In Manufacturing

Working in an industry such as manufacturing can be a dangerous job, as one is surrounded by and working on heavy machinery and moving parts. This can cause many workplace injuries, which can often cause one to have to not be employed for months at a time. Not being employed can mean the difference between being able to pay for a doctor’s visit, and for that reason, among others, I chose to explore this variable within each income quartile. The scatterplots and regression outputs are shown in Figures 1.13 to 1.16, and the linear regression outputs are as follows (where X is equal to the fraction of people in a commuting zone working in manufacturing):

** Quartile 1: (LE) = 79.5143 - 1.9435X Quartile 2: (LE) = 82.48242 + 0.18055X Quartile 3: (LE) = 84.42843 + 0.54356X Quartile 4: (LE) = 86.4027 - 0.8812X**

Within these four models, the only statistically significant results come from the first income quartile (p = 0.0009991). Because of this, I will only continue to analyze data from the first income quartile. The question of why only data from Q1 is significant can be raised from this. It is a fact that people working in manufacturing or blue collar industries are often paid less than those with office jobs, which can lead to the idea that Q1 is significant because it has the highest percentage people working in manufacturing compared to other quartiles.

Figure 1.13 shows that as the fraction of people working in manufacturing increases, life expectancy decreases. To interpret the negative coefficient, we can again use the value from the South Boston commuting zone to see the impact of working in manufacturing on life expectancy. In the South Boston commuting zone, 0.1203 or 12.03% of the population works in manufacturing. Substituting this value into the Q1 regression output, we see:

79.5143 - 1.9435(0.1203) = 79.2804

This value tells us that when 12.03% of a population within a commuting zone works in manufacturing in the first income quartile, their expected life expectancy is 79.2804 years, which is a 0.2339 difference from the mean of 79.5143. Overall, the trend shows that as the average percentage of people working in manufacturing increases, life expectancy decreases.

Labor Force Participation Rate

The next variable I chose to explore with linear regression is the labor force participation rate. This variable was not separated by income quartile, but I again chose to regress it within each income quartile’s life expectancy variable. The definition of the labor force is the percentage of the civilian population, not institutionalized, 16 and older that are working or looking for work actively. I chose to explore this variable because labor force participation is indicative of many things, such as the amount of labor resources available within a given market. The scatterplots and regression outputs are shown in Figures 1.17 to 1.20 are as follows (where X is the labor force participation rate percentage in a commuting zone):

** Quartile 1: (LE) = 77.9314 + 2.0669X Quartile 2: (LE) = 79.4858 + 4.9025X Quartile 3: (LE) = 81.0273 + 5.6511X Quartile 4: (LE) = 81.8176 + 7.2043X**

Here, the trend of the Y-intercept, or life expectancy when the mean is zero, increasing alongside income quartile is again apparent. The trend of the correlational coefficient increasing as income quartile increases is apparent as well. However, these are all positive correlational coefficients, showing an overall trend of life expectancy increasing as the fraction of people in the labor force increases in a commuting zone. Because the results are statistically significant, (p = 0.008658, p = 1.553e-11, p = 8.98e-16, p = 2.2e-16), it is safe to make assumptions for the data.

To make sense of the correlational coefficients, we can use the South Boston commuting zone as an example for two income quartile examples, in this case the first and the fourth. In the South Boston commuting zone, the labor force participation rate is 0.665.

Quartile 1: 77.9314 + 2.0669(0.665) = 79.306 Difference of means = + 1.3746 Quartile 4: 81.8176 + 7.2043(0.665) = 86.608 Difference of means = + 4.7904

Within the same commuting zone, we can see that the same labor force participation rate created a greater increase in life expectancy in the fourth income quartile than in the first, with the difference between the two values relative to each other being 3.4158. This tells us that the labor force participation rate has a greater impact within the fourth quartile than on any other quartile.

Fraction of People with a Commute Less than 15 Minutes

The next variable I chose to explore was the fraction of people within a commuting zone with a commute time to work being less than fifteen minutes. I chose to explore this variable’s relation to life expectancy because it has been proven that longer commutes can lead to higher rates of stress and mental health issues. If one’s commute consists of driving, they are on the road longer, meaning they have an increased chance of being involved in a car accident, which can also affect life expectancy. Again, I chose to explore this variable’s effect on the life expectancy of different income quartiles. The scatterplots and regression outputs are shown in Figures 1.21 to 1.24, where X is the fraction of people with a commute time over 15 minutes in a commuting zone:

** Quartile 1: (LE) = 78.7829 + 1.0295X Quartile 2: (LE) =81.3830 + 2.7391X Quartile 3: (LE) = 83.3716 + 2.7752X Quartile 4: (LE) = 85.8372 + 1.0346X**

The same trend of the Y-intercept increasing as the quartile increases is again present. The Y-intercept value represents life expectancy when the mean value of the fraction of people within a commuting zone with commutes longer than fifteen minutes is zero. However, the correlational coefficients seem to have no direct relationship to income quartile, as there is no pattern. All correlational coefficients are positive, which shows that as the fraction of people with a commute longer than 15 minutes increases, life expectancy increases as well.

Because our results are statistically significant, (p = 0.01281, p = 6.394e-13, p = 6.823e-14, p = 0.01024), it is safe to make assumptions about the data. However, it is interesting to observe that the income quartiles with the greatest p-values, Q1 and Q4, have the smallest correlational coefficients. These results are different from my original hypothesis that as a population has greater commuting times, their overall life expectancy will decrease.

Percent Change in Labor Force 1980-2000

The last work related variable I chose to explore with simple linear regression was the percentage change in the labor force from 1980 to 2000. I chose this variable because it can indicate the growth or decay of a commuting zone’s economy through how many people are employed in that commuting zone’s economy. Figures 1.25 through 1.28 show scatterplots and regression outputs for the percentage change in labor force plotted on life expectancy by income quartile, where X is equal to the percentage change in labor force in a commuting zone.

** Quartile 1: (LE) = 78.91462 + 1.04686X Quartile 2: (LE) = 82.39775 + 0.40574X Quartile 3: (LE) = 84.47788 + 0.13098X Quartile 4: (LE) = 86.11023 + 0.54831X** Again, the trend of the Y-intercept increasing as income quartile increases is present again. The Y-intercept is representative of the mean life expectancy predicted when the percentage change in labor force in a commuting zone is zero. It appears that across income quartiles there seems to be no trend between the correlational coefficients. The correlational coefficients are positive, which indicates that as the percentage change in labor force increases positively, life expectancy decreases. Because the results from the third income quartile are not significant, (p = 0.3909), I am only going to explore the results from Q1, Q2, and Q3, which are statistically significant (p = 1.797e-10, p = 0.009416, p = 0.0007074)

To better interpret what these regression equations mean, I am going to compare life expectancy in the first income quartile in two different commuting zones, a zone where there has been negative labor force participation change, and a zone where there has been positive labor force participation change. I am going to use the Welch commuting zone in West Virginia with a negative percentage change of -0.3743832, and the Spartanburg commuting zone in South Carolina with a positive percentage change of 0.2851822. Inserting these values into the Q1 regression equation gives us:

Welch: 78.91462 + 1.04686(-0.3743832) = 78.5226932
Difference from mean: - 0.3919268
Spartanburg: 78.91462 + 1.04686(0.2851822) = 79.21316583
Difference from mean: + 0.29854583

These results show that a negative change in labor force participation, or a decline in the number of participants in the labor force in a commuting zone, results in a decrease in life expectancy in the first income quartile. A positive change in labor force participation, or an increase in the number of participants in the labor force in a commuting zone, results in an increase in the average life expectancy in the first income quartile. Because the results are statistically significant, it is safe to make assumptions about the data. An increase in labor force participation rate in a commuting zone can indicate a growing economy, such as more families moving to a commuting zone. More people working means more people adding to household income and wealth, which can increase their accessibility to healthcare, education, and food.

Multiple Regression Exploration

The exploratory data analysis consistenly showed evidence of statistically significant correlations within work and health variables in the first income quartile, ranging from the 1st to 25th percentile. To further explore these connections that may affect life expectancy in terms of health and work, I am going to create a multiple regression model that will use other explanatory variables as well as the ones shown prior to more accurately predict life expectancies within the first income quartile.

Multiple Regression Health Exploration

Data shows that greater expenditures on healthcare result in longer life expectancies, making it clear that health and healthcare play vital roles in living a long life. While we have explored how these variables individually effect life expectancy, I decided to create a multiple regression model to combine these variables of different types. After analyzing the data through simple linear regression models and correlation matrices, I decided to create a multiple regression model that uses the fraction of current smokers in Q1, the fraction of the obese population within Q1, as well as the percentage of uninsured people in 2010 to predict life expectancy in the first income quartile. I chose these variables not because of their strong r correlational coefficients, but because of their ability to represent different aspects of healthcare. For example, it is a personal decision to smoke, as one is aware of the health consequences. However, it is not always a personal decision to be uninsured. It can be possible one has lost a job, and therefore health benefits, or that they simply cannot afford to pay for health insurance. Exploring both types of variables allows for greater connections between life expectancy and healthcare to be drawn.

Correlation Matrix

To investigate other variables within the dataset that have not been yet explored but may have strong associations with life expectancy, I created a correlation matrix in R to find other variables that strongly affect life expectancy. Looking at the results, the variables that hold the strongest associations with life expectancy, positive or negative, are (see Figure 2.1):

-0.317 Fraction of smokers in the first income quartile
-0.459 Average medicare expenditure per enrollee within a commuting zone
-0.233 30-day mortality rate for heart attack patients

Out of the ten health-related variables in the table, there is only one positive r-value: 0.0786, the average 30-day mortality rate for heart failure patients. This is a surprising find, suggesting that as the 30-day mortality rate for heart failure patients increases, life expectancy increases. These values tell us that within our dataset of health variables, the strongest correlations are negative, between average medicare expenditure per enrollee in a commuting zone, the fraction of smokers, and the 30-day mortality rate for heart attack patients. This tells us that as all variables increase (disregarding the 30-day mortality rate for heart failure patients value), life expectancy decreases.

Collectively looking at the data, the strongest correlations between predictor variables exist between:

0.3958 Percentage of uninsured people in 2010 and average Medicare expenditure per enrollee
1. As the the ercentage of uninsured people in 2010 increases, the average Medicare expenditure per enrollee increases
0.7873 30-day hospital mortality rate index and the 30-day mortality rate for heart attacks
1. As the 30-day hospital mortality rate index increases, the 30-day mortality rate for heart attacks
0.2698 30-day hospital mortality rate for pneumonia and the percentage of uninsured people in 2010
1. As the 30-day hospital mortality rate for pneumonia increases, the percentage of uninsured people in 2010 increases
0.3037 Fraction of obese people in the first income quartile and average Medicare expenditure per enrollee
1. As the fraction of obese people in the first income quartile increases, average Medicare expenditure per enrollee increases

These r-values are not in relation to the response variable, but rather, in relation to other predictor variables. However, most of these values are low, which is beneficial for the multiple regression model. Stronger associations between explanatory variables can indicate multicollinearity, which can mean a less accurate model. There are a few cases of multicollinearity within the table, which we should investigate further.

The strongest r-value in the table is 0.7873. This value indicates that as the 30-day hospital mortality rate index increases, the 30-day mortality rate for heart attacks increases as well. This is a case of multicollinearity, as the two variables are strongly correlated. The 30-day hospital mortality rate accounts for the other 30-day mortality rate values within the table for heart attacks, heart failure, and pneumonia. To rid the model of cases of multicollinearity, the final model will only use the overall 30-day mortality rate index, as it accounts for these three variables.

Multiple Regression Model

To create a multiple regression model, I chose to use variables that showed no cases of multicollinearity within the correlation matrix, and variables that showed strong correlation values with life expectancy. Creating a multiple regression model comes with trial and error, so to ensure we create the most accurate model, I used backward step regression in order to create the most accurate model. The outputs of these models are in Figures 2.2, 2.3, and 2.4.

**Model 1** (LE) = 83.45 - 3.524(Fraction of Smokers) + 0.4922(Fraction of Obese People) + 2.69(Percentage Uninsured) - 0.0003658(Avg Medicare $ Per Enrollee) - 0.229(30-Day Hospital Mortality Rate Index) - 0.5114(Percent of Medicare Enrollees with at least 1 Primary Care Visit)

**Model 2** (LE) = 83.09 - 3.568(Fraction of Smokers) + 0.4259(Fraction of Obese People) + 2.727(Percentage Uninsured) - 0.0003682(Avg Medicare $ Per Enrollee) - 0.2303(30-Day Hospital Mortality Rate Index)

**Model 3** (LE) = 83.16 - 3.56(Fraction of Smokers) + 2.659(Percentage Uninsured) - 0.0003607(Avg Medicare $ Per Enrollee) - 0.2236(30-day Hospital Mortality Rate Index)

In order to create these models, I utilized the backwards regression method. This approach is one that first creates a model that uses all explanatory variables, and eliminates variables one at a time when the p-value is greater than the alpha value. Once I reached Model 3, I stopped eliminating, as all p-values are statistically significant, and the model had the highest adjusted R-squared value of all models. Here, the R-squared value of 0.2873 means that Model 3 explains 28.73% of the data in the regression model. The model itself is statistically significant, meaning we can make assumptions about the results, and that there is a significant association between our final four predictor variables and life expectancy (p = 2.2e-16).

What kind of relationship does this model show between the variables? To analyze these relationships, we can look at the Y-intercept and correlational coefficients. The Y-intercept of 83.16 tells us that when the values of all of the explanatory variables are equal to zero, the predicted life expectancy is 83.16 years. To observe more relationships, we can look at the coefficients. First, it is helpful to use an example commuting zone to interpret the equation overall better. To do this, we are going to use the South Boston commuting zone data in the final regression model. 83.16 - 3.56(0.2524504) + 2.659(0.05015712) - 0.0003607(9840.8985) - 0.2236(-1.130316) = 79.098

Based on our final multiple regression model, the predicted life expectancy of a person in the first income quartile living in a Boston commuting zone is 79.1 years.

It is important to remember the units of which the data was collected in when looking at the correlational coefficients. For example, the Average Medicare Expenditure per Enrollee coefficient is quite small compared to the others. This is because this statistic was taken as a whole number rather than a fraction. For example, in South Boston, the expenditure per enrollee is 9840.8985, while the fraction of smokers is 0.0501. Because the variable being predicted is life expectancy, the coefficients need to be relative to a two digit number. When reading the correlational coefficient for the fraction of smokers, keep in mind that the value of 2.695 does not mean that as the fraction of smokers increases by 0.01, 2.695 years are not added to predicted life expectancy. It is also important to keep in mind that this is a multiple regression output, not a linear output. This calls for a different interpretation of the coefficients themselves. Using the fraction of smokers as an example, we can assume that when all other variables in the model are held constant, a change of 0.01 in the fraction of smokers is associated with a 2.695 change in life expectancy. It is important to remember that this is a multiplier of 0.01, not 1.

Looking at the model as a whole, we observe that the residual standard error is 0.9408 on 590 degrees of freedom. This means that this model predicts the average life expectancy of a person in the first income quartile with an average error of 0.9408 years. The importance of this value comes from its ability to compare the accuracy of different regression models. For example, when comparing the residual standard error of Model 3 with Model 1, we see that Model 1 predicts the average life expectancy of a person in the first income quartile with an average error of 0.9416 years. This is a greater margin of error than Model 3, telling us that Model 3 is a more accurate model, and therefore, more effective.

Residuals

Residuals assess if the estimates and predictions made by the model are biased. Analyzing residuals is a great way to assess if estimates and predictions are biased.

Figure 2.5 shows a histogram of the residuals from Model 3. The histogram shows that the residuals are centered around zero, and that there is a normal distribution. This tells us that the residuals are normally distributed, and that there is no skew or cases of outliers in the model.

The first graph in Figure 2.6 shows a Residuals vs Fitted plot. The data appears to be equally distributed below and above the zero line, but appears to be banded near the center. The red line shows a bit of a curve, but mostly stays near the zero line. The residuals appear to increase positively as the residuals move away from the mean, at around 79.5, and the residuals appear to decrease towards zero as the residuals move towards the mean, which is centered around 79.5.

The second graph in Figure 2.6 shows a Normal Quantile-Quantile plot. The residuals are centered around the line, with a small amount of right skew. There is a left-leaning outlier at the beginning of the plot, but otherwise, this shows that the data follows conditions for normality.

Appendix

# import libraries and data
## NEW DATASET
library(ggplot2)
library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(corrplot)

## corrplot 0.92 loaded

library(ggpubr)
library(leaps)
library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

final = read.csv("~/desktop/lastSet.csv")

Figure 1.1

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q1_both ~ final$bmi_obese_q1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3075 -0.7857 -0.0622  0.7178  3.9039 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         79.9237     0.1909 418.578  < 2e-16 ***
## final$bmi_obese_q1  -2.3808     0.6161  -3.864 0.000124 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.102 on 593 degrees of freedom
## Multiple R-squared:  0.02456,    Adjusted R-squared:  0.02292 
## F-statistic: 14.93 on 1 and 593 DF,  p-value: 0.0001238

Figure 1.2

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q2_both ~ final$bmi_obese_q2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3177 -0.6744 -0.0320  0.6411  3.6904 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         83.3790     0.1594 523.073  < 2e-16 ***
## final$bmi_obese_q2  -3.2225     0.5711  -5.643 2.59e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.02 on 593 degrees of freedom
## Multiple R-squared:  0.05096,    Adjusted R-squared:  0.04936 
## F-statistic: 31.84 on 1 and 593 DF,  p-value: 2.593e-08

Figure 1.3

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q3_both ~ final$bmi_obese_q3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9935 -0.6226 -0.0844  0.6754  3.2126 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         85.0321     0.1275 667.155  < 2e-16 ***
## final$bmi_obese_q3  -1.9639     0.4576  -4.292 2.07e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.005 on 593 degrees of freedom
## Multiple R-squared:  0.03013,    Adjusted R-squared:  0.02849 
## F-statistic: 18.42 on 1 and 593 DF,  p-value: 2.07e-05

Figure 1.4

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q4_both ~ final$bmi_obese_q4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2980 -0.5588  0.0420  0.6323  3.9649 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         86.9885     0.1301 668.815  < 2e-16 ***
## final$bmi_obese_q4  -3.1784     0.5375  -5.913 5.67e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 593 degrees of freedom
## Multiple R-squared:  0.05568,    Adjusted R-squared:  0.05409 
## F-statistic: 34.97 on 1 and 593 DF,  p-value: 5.669e-09

Figure 1.5

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q1_both ~ final$cur_smoke_q1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0725 -0.7282 -0.0009  0.6676  3.8164 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          80.689      0.187 431.459  < 2e-16 ***
## final$cur_smoke_q1   -5.352      0.657  -8.146 2.24e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.058 on 593 degrees of freedom
## Multiple R-squared:  0.1006, Adjusted R-squared:  0.09912 
## F-statistic: 66.35 on 1 and 593 DF,  p-value: 2.238e-15

Figure 1.6

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q2_both ~ final$cur_smoke_q2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2546 -0.6472 -0.0169  0.6617  3.7749 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         83.1997     0.1692 491.760  < 2e-16 ***
## final$cur_smoke_q2  -3.1559     0.7506  -4.204 3.02e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.032 on 593 degrees of freedom
## Multiple R-squared:  0.02895,    Adjusted R-squared:  0.02731 
## F-statistic: 17.68 on 1 and 593 DF,  p-value: 3.022e-05

Figure 1.7

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q3_both ~ final$cur_smoke_q3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0392 -0.6268 -0.0738  0.6190  3.1364 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         84.8161     0.1060 800.274  < 2e-16 ***
## final$cur_smoke_q3  -1.7895     0.5785  -3.093  0.00207 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.012 on 593 degrees of freedom
## Multiple R-squared:  0.01588,    Adjusted R-squared:  0.01422 
## F-statistic: 9.569 on 1 and 593 DF,  p-value: 0.002072

Figure 1.8

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q4_both ~ final$cur_smoke_q4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7793 -0.5714  0.0611  0.6603  4.2660 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         86.7688     0.1030 842.505  < 2e-16 ***
## final$cur_smoke_q4  -4.1519     0.7667  -5.416 8.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.061 on 593 degrees of freedom
## Multiple R-squared:  0.04713,    Adjusted R-squared:  0.04552 
## F-statistic: 29.33 on 1 and 593 DF,  p-value: 8.884e-08

Figure 1.9

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q1_both ~ final$puninsured2010)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3627 -0.7913 -0.0691  0.7154  4.3291 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           79.6307     0.1707 466.528   <2e-16 ***
## final$puninsured2010  -2.3431     0.9094  -2.577   0.0102 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.109 on 593 degrees of freedom
## Multiple R-squared:  0.01107,    Adjusted R-squared:  0.009404 
## F-statistic: 6.639 on 1 and 593 DF,  p-value: 0.01022

Figure 1.10

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q2_both ~ final$puninsured2010)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0798 -0.6926 -0.1000  0.6203  3.4446 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           83.8513     0.1506 556.670   <2e-16 ***
## final$puninsured2010  -7.4084     0.8025  -9.232   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9788 on 593 degrees of freedom
## Multiple R-squared:  0.1257, Adjusted R-squared:  0.1242 
## F-statistic: 85.22 on 1 and 593 DF,  p-value: < 2.2e-16

Figure 1.11

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q3_both ~ final$puninsured2010)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6541 -0.6009 -0.0622  0.5867  3.1488 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           86.1536     0.1406   612.8   <2e-16 ***
## final$puninsured2010  -9.0602     0.7490   -12.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9135 on 593 degrees of freedom
## Multiple R-squared:  0.1979, Adjusted R-squared:  0.1966 
## F-statistic: 146.3 on 1 and 593 DF,  p-value: < 2.2e-16

Figure 1.12

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q4_both ~ final$puninsured2010)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8566 -0.5299  0.0114  0.5555  4.0240 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           87.7835     0.1542  569.33   <2e-16 ***
## final$puninsured2010  -8.4030     0.8214  -10.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.002 on 593 degrees of freedom
## Multiple R-squared:   0.15,  Adjusted R-squared:  0.1486 
## F-statistic: 104.6 on 1 and 593 DF,  p-value: < 2.2e-16

Figure 1.13

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q1_both ~ final$cs_elf_ind_man)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7457 -0.7579 -0.0151  0.7064  3.9458 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           79.5143     0.1034 768.778  < 2e-16 ***
## final$cs_elf_ind_man  -1.9435     0.5876  -3.307 0.000999 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.105 on 593 degrees of freedom
## Multiple R-squared:  0.01811,    Adjusted R-squared:  0.01646 
## F-statistic: 10.94 on 1 and 593 DF,  p-value: 0.0009991

Figure 1.14

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q2_both ~ final$cs_elf_ind_man)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5459 -0.7037 -0.0371  0.6585  3.9001 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          82.48242    0.09795 842.047   <2e-16 ***
## final$cs_elf_ind_man  0.18055    0.55654   0.324    0.746    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.047 on 593 degrees of freedom
## Multiple R-squared:  0.0001775,  Adjusted R-squared:  -0.001509 
## F-statistic: 0.1052 on 1 and 593 DF,  p-value: 0.7457

Figure 1.15

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q3_both ~ final$cs_elf_ind_man)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0168 -0.6572 -0.0651  0.6578  3.2674 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          84.42843    0.09538 885.181   <2e-16 ***
## final$cs_elf_ind_man  0.54356    0.54191   1.003    0.316    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.019 on 593 degrees of freedom
## Multiple R-squared:  0.001694,   Adjusted R-squared:  1.024e-05 
## F-statistic: 1.006 on 1 and 593 DF,  p-value: 0.3163

Figure 1.16

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q4_both ~ final$cs_elf_ind_man)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7449 -0.5924  0.0411  0.6854  3.9779 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           86.4027     0.1015 851.240   <2e-16 ***
## final$cs_elf_ind_man  -0.8812     0.5767  -1.528    0.127    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.085 on 593 degrees of freedom
## Multiple R-squared:  0.003922,   Adjusted R-squared:  0.002242 
## F-statistic: 2.335 on 1 and 593 DF,  p-value: 0.1271

Figure 1.17

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q1_both ~ final$cs_labforce)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3833 -0.7817 -0.0513  0.7104  3.8663 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        77.9314     0.4863 160.241  < 2e-16 ***
## final$cs_labforce   2.0669     0.7847   2.634  0.00866 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.109 on 593 degrees of freedom
## Multiple R-squared:  0.01156,    Adjusted R-squared:  0.009898 
## F-statistic: 6.938 on 1 and 593 DF,  p-value: 0.008658

Figure 1.18

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q2_both ~ final$cs_labforce)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8832 -0.6556 -0.0454  0.6484  3.6667 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        79.4858     0.4418 179.902  < 2e-16 ***
## final$cs_labforce   4.9025     0.7129   6.877 1.55e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.007 on 593 degrees of freedom
## Multiple R-squared:  0.07386,    Adjusted R-squared:  0.0723 
## F-statistic:  47.3 on 1 and 593 DF,  p-value: 1.553e-11

Figure 1.19

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q3_both ~ final$cs_labforce)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5615 -0.6200 -0.0168  0.6492  2.8993 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        81.0273     0.4236 191.269  < 2e-16 ***
## final$cs_labforce   5.6511     0.6835   8.268 8.98e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9659 on 593 degrees of freedom
## Multiple R-squared:  0.1034, Adjusted R-squared:  0.1018 
## F-statistic: 68.36 on 1 and 593 DF,  p-value: 8.98e-16

Figure 1.20

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q4_both ~ final$cs_labforce)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5042 -0.5586  0.0028  0.6090  5.3908 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        81.8176     0.4399  185.97   <2e-16 ***
## final$cs_labforce   7.2043     0.7098   10.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.003 on 593 degrees of freedom
## Multiple R-squared:  0.148,  Adjusted R-squared:  0.1466 
## F-statistic:   103 on 1 and 593 DF,  p-value: < 2.2e-16

Figure 1.21

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q1_both ~ final$frac_traveltime_lt15)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8172 -0.7733 -0.0577  0.7453  4.1939 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 78.7829     0.1758 448.137   <2e-16 ***
## final$frac_traveltime_lt15   1.0295     0.4124   2.496   0.0128 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.11 on 593 degrees of freedom
## Multiple R-squared:  0.0104, Adjusted R-squared:  0.008732 
## F-statistic: 6.232 on 1 and 593 DF,  p-value: 0.01281

Figure 1.22

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q2_both ~ final$frac_traveltime_lt15)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1040 -0.5849  0.0165  0.6396  3.2101 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 81.3830     0.1588 512.564  < 2e-16 ***
## final$frac_traveltime_lt15   2.7391     0.3724   7.355 6.39e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.002 on 593 degrees of freedom
## Multiple R-squared:  0.08359,    Adjusted R-squared:  0.08205 
## F-statistic: 54.09 on 1 and 593 DF,  p-value: 6.394e-13

Figure 1.23

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q3_both ~ final$frac_traveltime_lt15)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8464 -0.5889 -0.0008  0.6496  2.9954 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 83.3716     0.1541 540.856  < 2e-16 ***
## final$frac_traveltime_lt15   2.7752     0.3616   7.675 6.82e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9729 on 593 degrees of freedom
## Multiple R-squared:  0.09037,    Adjusted R-squared:  0.08883 
## F-statistic: 58.91 on 1 and 593 DF,  p-value: 6.823e-14

Figure 1.24

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q4_both ~ final$frac_traveltime_lt15)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7787 -0.5930  0.0876  0.7076  4.1172 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 85.8372     0.1712 501.286   <2e-16 ***
## final$frac_traveltime_lt15   1.0346     0.4017   2.576   0.0102 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.081 on 593 degrees of freedom
## Multiple R-squared:  0.01107,    Adjusted R-squared:  0.009397 
## F-statistic: 6.635 on 1 and 593 DF,  p-value: 0.01024

Figure 1.25

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q1_both ~ final$lf_d_2000_1980)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2036 -0.7436 -0.1100  0.6508  3.3316 
## 
## Coefficients:
##                      Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)          78.91462    0.06307 1251.241  < 2e-16 ***
## final$lf_d_2000_1980  1.04686    0.16127    6.492  1.8e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.078 on 593 degrees of freedom
## Multiple R-squared:  0.06635,    Adjusted R-squared:  0.06477 
## F-statistic: 42.14 on 1 and 593 DF,  p-value: 1.797e-10

Figure 1.26

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q2_both ~ final$lf_d_2000_1980)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3779 -0.6870 -0.0533  0.6550  3.9904 
## 
## Coefficients:
##                      Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)          82.39775    0.06091 1352.732  < 2e-16 ***
## final$lf_d_2000_1980  0.40574    0.15575    2.605  0.00942 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.041 on 593 degrees of freedom
## Multiple R-squared:  0.01131,    Adjusted R-squared:  0.009647 
## F-statistic: 6.786 on 1 and 593 DF,  p-value: 0.009416

Figure 1.27

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q3_both ~ final$lf_d_2000_1980)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1641 -0.6435 -0.0569  0.6591  3.2558 
## 
## Coefficients:
##                      Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)          84.47788    0.05966 1416.044   <2e-16 ***
## final$lf_d_2000_1980  0.13098    0.15254    0.859    0.391    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.019 on 593 degrees of freedom
## Multiple R-squared:  0.001242,   Adjusted R-squared:  -0.0004425 
## F-statistic: 0.7373 on 1 and 593 DF,  p-value: 0.3909

Figure 1.28

## `geom_smooth()` using formula 'y ~ x'

## 
## Call:
## lm(formula = final$le_q4_both ~ final$lf_d_2000_1980)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9783 -0.5713 -0.0077  0.6938  4.0578 
## 
## Coefficients:
##                      Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)          86.11023    0.06298 1367.159  < 2e-16 ***
## final$lf_d_2000_1980  0.54831    0.16105    3.405 0.000707 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.076 on 593 degrees of freedom
## Multiple R-squared:  0.01917,    Adjusted R-squared:  0.01752 
## F-statistic: 11.59 on 1 and 593 DF,  p-value: 0.0007074

Figure 2.1

##                          le_q1_both cur_smoke_q1 bmi_obese_q1 puninsured2010
## le_q1_both               1.00000000  -0.31722614  -0.15672168   -0.105220919
## cur_smoke_q1            -0.31722614   1.00000000   0.09334105   -0.033256781
## bmi_obese_q1            -0.15672168   0.09334105   1.00000000    0.081624526
## puninsured2010          -0.10522092  -0.03325678   0.08162453    1.000000000
## mort_30day_hosp_z       -0.17842438  -0.01965585   0.19193549    0.296073941
## primcarevis_10          -0.14294904   0.15613247   0.23262032    0.004962686
## adjmortmeas_amiall30day -0.23354605   0.02288609   0.23348424    0.299229203
## reimb_penroll_adj10     -0.45946478   0.23855328   0.30372024    0.395783327
## adjmortmeas_pnall30day  -0.20703622   0.02743093   0.24202417    0.269819905
## adjmortmeas_chfall30day  0.07861943  -0.11364151  -0.08627204    0.070133719
##                         mort_30day_hosp_z primcarevis_10
## le_q1_both                    -0.17842438   -0.142949039
## cur_smoke_q1                  -0.01965585    0.156132474
## bmi_obese_q1                   0.19193549    0.232620319
## puninsured2010                 0.29607394    0.004962686
## mort_30day_hosp_z              1.00000000    0.064954040
## primcarevis_10                 0.06495404    1.000000000
## adjmortmeas_amiall30day        0.78733012    0.069136537
## reimb_penroll_adj10            0.06539899    0.181809340
## adjmortmeas_pnall30day         0.76503413    0.107289716
## adjmortmeas_chfall30day        0.69470222   -0.048158934
##                         adjmortmeas_amiall30day reimb_penroll_adj10
## le_q1_both                          -0.23354605         -0.45946478
## cur_smoke_q1                         0.02288609          0.23855328
## bmi_obese_q1                         0.23348424          0.30372024
## puninsured2010                       0.29922920          0.39578333
## mort_30day_hosp_z                    0.78733012          0.06539899
## primcarevis_10                       0.06913654          0.18180934
## adjmortmeas_amiall30day              1.00000000          0.23113923
## reimb_penroll_adj10                  0.23113923          1.00000000
## adjmortmeas_pnall30day               0.37405466          0.17740663
## adjmortmeas_chfall30day              0.35523656         -0.33478321
##                         adjmortmeas_pnall30day adjmortmeas_chfall30day
## le_q1_both                         -0.20703622              0.07861943
## cur_smoke_q1                        0.02743093             -0.11364151
## bmi_obese_q1                        0.24202417             -0.08627204
## puninsured2010                      0.26981990              0.07013372
## mort_30day_hosp_z                   0.76503413              0.69470222
## primcarevis_10                      0.10728972             -0.04815893
## adjmortmeas_amiall30day             0.37405466              0.35523656
## reimb_penroll_adj10                 0.17740663             -0.33478321
## adjmortmeas_pnall30day              1.00000000              0.30486138
## adjmortmeas_chfall30day             0.30486138              1.00000000

Figure 2.2 - Health MR Model 1

## Multiple Regression Model 1
hlth1 = lm(le_q1_both ~ cur_smoke_q1 + bmi_obese_q1 + puninsured2010 + reimb_penroll_adj10 + mort_30day_hosp_z + primcarevis_10, data = final)
summary(hlth1)

## 
## Call:
## lm(formula = le_q1_both ~ cur_smoke_q1 + bmi_obese_q1 + puninsured2010 + 
##     reimb_penroll_adj10 + mort_30day_hosp_z + primcarevis_10, 
##     data = final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5706 -0.6231  0.0322  0.5838  3.1122 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          8.345e+01  6.256e-01 133.405  < 2e-16 ***
## cur_smoke_q1        -3.524e+00  6.122e-01  -5.756 1.39e-08 ***
## bmi_obese_q1         4.922e-01  5.739e-01   0.858  0.39142    
## puninsured2010       2.690e+00  8.946e-01   3.007  0.00275 ** 
## reimb_penroll_adj10 -3.658e-04  3.366e-05 -10.865  < 2e-16 ***
## mort_30day_hosp_z   -2.290e-01  4.399e-02  -5.206 2.67e-07 ***
## primcarevis_10      -5.114e-01  7.706e-01  -0.664  0.50720    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9416 on 588 degrees of freedom
## Multiple R-squared:  0.2933, Adjusted R-squared:  0.2861 
## F-statistic: 40.68 on 6 and 588 DF,  p-value: < 2.2e-16

Figure 2.3 - Health MR Model 2

## Multiple Regression Model 2
hlth2 = lm(le_q1_both ~ cur_smoke_q1 + bmi_obese_q1 + puninsured2010 + reimb_penroll_adj10 + mort_30day_hosp_z, data = final)
summary(hlth2)

## 
## Call:
## lm(formula = le_q1_both ~ cur_smoke_q1 + bmi_obese_q1 + puninsured2010 + 
##     reimb_penroll_adj10 + mort_30day_hosp_z, data = final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6091 -0.6354  0.0285  0.5806  3.1081 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          8.309e+01  2.997e-01 277.249  < 2e-16 ***
## cur_smoke_q1        -3.568e+00  6.083e-01  -5.865 7.51e-09 ***
## bmi_obese_q1         4.259e-01  5.649e-01   0.754  0.45116    
## puninsured2010       2.727e+00  8.924e-01   3.056  0.00235 ** 
## reimb_penroll_adj10 -3.682e-04  3.345e-05 -11.009  < 2e-16 ***
## mort_30day_hosp_z   -2.303e-01  4.393e-02  -5.242 2.21e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9411 on 589 degrees of freedom
## Multiple R-squared:  0.2928, Adjusted R-squared:  0.2868 
## F-statistic: 48.77 on 5 and 589 DF,  p-value: < 2.2e-16

Figure 2.4 - Health MR Model 3

## Multiple Regression Model 3
hlthmr3 = lm(le_q1_both ~ cur_smoke_q1 + puninsured2010 + reimb_penroll_adj10 + mort_30day_hosp_z, data = final)
summary(hlthmr3)

## 
## Call:
## lm(formula = le_q1_both ~ cur_smoke_q1 + puninsured2010 + reimb_penroll_adj10 + 
##     mort_30day_hosp_z, data = final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7311 -0.6328  0.0257  0.6009  3.0756 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          8.316e+01  2.867e-01 290.072  < 2e-16 ***
## cur_smoke_q1        -3.560e+00  6.080e-01  -5.856 7.89e-09 ***
## puninsured2010       2.659e+00  8.876e-01   2.996  0.00285 ** 
## reimb_penroll_adj10 -3.607e-04  3.192e-05 -11.302  < 2e-16 ***
## mort_30day_hosp_z   -2.236e-01  4.300e-02  -5.200 2.76e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9408 on 590 degrees of freedom
## Multiple R-squared:  0.2921, Adjusted R-squared:  0.2873 
## F-statistic: 60.86 on 4 and 590 DF,  p-value: < 2.2e-16

LE Mini Project 2