R Project Final

Problem Statement

In this data set of Health Insurance, my objective is to explore whether age or education level plays a significant role in determining whether a person has health insurance or not. The data set provides information about individuals’ age, highest education attained, and health insurance status.

Data Exploration

library (ggplot2)
library (readr)
library (dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

setwd("C:\\Users\\mikha\\OneDrive\\Documents")

data <- read.csv("HealthInsurance.csv")

summary(data)

##        X           health               age           limit          
##  Min.   :   1   Length:8802        Min.   :18.00   Length:8802       
##  1st Qu.:2201   Class :character   1st Qu.:30.00   Class :character  
##  Median :4402   Mode  :character   Median :39.00   Mode  :character  
##  Mean   :4402                      Mean   :38.94                     
##  3rd Qu.:6602                      3rd Qu.:48.00                     
##  Max.   :8802                      Max.   :62.00                     
##     gender           insurance           married            selfemp         
##  Length:8802        Length:8802        Length:8802        Length:8802       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      family          region           ethnicity          education        
##  Min.   : 1.000   Length:8802        Length:8802        Length:8802       
##  1st Qu.: 2.000   Class :character   Class :character   Class :character  
##  Median : 3.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 3.094                                                           
##  3rd Qu.: 4.000                                                           
##  Max.   :14.000

head (data, 10)

mean_age <- mean(data$age)

median_age <- median(data$age)

mean_family <- mean(data$family)

median_family <- median(data$family)

quant25_age <- quantile(data$age, probs = .25)

quant75_age <- quantile(data$age, probs = .75)

quant25_family <- quantile(data$family, probs = .25)

quant75_family <- quantile(data$family, probs = .75)

mean_age

## [1] 38.93683

median_age

## [1] 39

mean_family

## [1] 3.093501

median_family

## [1] 3

quant25_age

## 25% 
##  30

quant75_age

## 75% 
##  48

quant25_family

## 25% 
##   2

quant75_family

## 75% 
##   4

Here I looked at the only 2 numeric dataset for age and family size. Looking at the mean and median for both datasets, we see that they are almost identical. This tells me that the data set has a symmetrical distribution.

Data Manipultion

Here I just wanted to just rename some columns since some of them were very vague using the dplyr library. I also wanted to create subset data for age and education just to simplify some things and make it easier to graph data.

data_new <- data %>%
  mutate(Insurance = insurance,Self_Employed = selfemp, Age = age, Family_size = family,Ethnicity = ethnicity, Highest_Education_Attained = education, Region_in_US = region, Married = married,Healthy = health) %>%
  select(-insurance, -selfemp, -age, -family, -ethnicity, -education, -region, -married, -health)


age_subset <- select(data_new, 'Age','Insurance')
head (age_subset,10)

education_subset <- select(data_new, 'Highest_Education_Attained','Insurance')
head (education_subset,10)

Data Visualization

Below I used histograms and bar graphs to represent my data. Below we can see the distribution of both the age and education within my data set.

ggplot(education_subset, aes(x = Highest_Education_Attained, fill = Insurance)) + geom_bar(position = "stack") + labs(title = "Education Distribution by Insurance",x = "Highest Education Attained", y = "Count",fill = "Insurance")

ggplot(education_subset, aes(x = Highest_Education_Attained, fill = Insurance)) + geom_bar(position = "fill", stat = "count") + labs(title = "Education Level vs Insurance (Percentage)", x = "Highest Education Attained", y = "Percentage", fill = "Insurance")

data_new$Age_Distribution <- cut(age_subset$Age, breaks = c(0, 20, 30, 40, 50, 60, Inf), labels = c("0-20", "21-30", "31-40", "41-50", "51-60", "61+"))

ggplot(data_new, aes(x = Age_Distribution, fill = Insurance)) + geom_bar(position = "fill") + labs(title = "Age vs Insurance (Percentage)",x = "Age Distribution",y = "Percentage",fill = "Insurance")

My file is too large even though the code is correct which is why I cant connect to the file. I have attached the code below to show proof of my work.

## My file is too large even though the code is correct which is why I cant connect to the file
## github_url <- "https://raw.githubusercontent.com/MAB592/R-Final-Project/main/HealthInsurance.csv"
## git_data <- read.csv(github_url)
## head(git_data, 10)

Conclusion

Based on the provided data, there appears to be a positive correlation between age and education. The bar graph illustrates that the majority of individuals in this dataset hold a high school degree, with most falling within the 30-40 age range.

To gain deeper insights from the data, I utilized graphs to represent the information as percentages. Observing the educational data, it becomes apparent that individuals who advance to higher levels of education are more likely to have insurance coverage than those with lower forms of education.Also analyzing the age data reveals a steady increase in insurance coverage across different age ranges until reaching the 40s, where the trend seems to stabilize.

Overall, the data suggests a positive relationship between age, education, and insurance coverage.