In this data set of Health Insurance, my objective is to explore whether age or education level plays a significant role in determining whether a person has health insurance or not. The data set provides information about individuals’ age, highest education attained, and health insurance status.
library (ggplot2)
library (readr)
library (dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
setwd("C:\\Users\\mikha\\OneDrive\\Documents")
data <- read.csv("HealthInsurance.csv")
summary(data)
## X health age limit
## Min. : 1 Length:8802 Min. :18.00 Length:8802
## 1st Qu.:2201 Class :character 1st Qu.:30.00 Class :character
## Median :4402 Mode :character Median :39.00 Mode :character
## Mean :4402 Mean :38.94
## 3rd Qu.:6602 3rd Qu.:48.00
## Max. :8802 Max. :62.00
## gender insurance married selfemp
## Length:8802 Length:8802 Length:8802 Length:8802
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## family region ethnicity education
## Min. : 1.000 Length:8802 Length:8802 Length:8802
## 1st Qu.: 2.000 Class :character Class :character Class :character
## Median : 3.000 Mode :character Mode :character Mode :character
## Mean : 3.094
## 3rd Qu.: 4.000
## Max. :14.000
head (data, 10)
mean_age <- mean(data$age)
median_age <- median(data$age)
mean_family <- mean(data$family)
median_family <- median(data$family)
quant25_age <- quantile(data$age, probs = .25)
quant75_age <- quantile(data$age, probs = .75)
quant25_family <- quantile(data$family, probs = .25)
quant75_family <- quantile(data$family, probs = .75)
mean_age
## [1] 38.93683
median_age
## [1] 39
mean_family
## [1] 3.093501
median_family
## [1] 3
quant25_age
## 25%
## 30
quant75_age
## 75%
## 48
quant25_family
## 25%
## 2
quant75_family
## 75%
## 4
Here I looked at the only 2 numeric dataset for age and family size. Looking at the mean and median for both datasets, we see that they are almost identical. This tells me that the data set has a symmetrical distribution.
Here I just wanted to just rename some columns since some of them were very vague using the dplyr library. I also wanted to create subset data for age and education just to simplify some things and make it easier to graph data.
data_new <- data %>%
mutate(Insurance = insurance,Self_Employed = selfemp, Age = age, Family_size = family,Ethnicity = ethnicity, Highest_Education_Attained = education, Region_in_US = region, Married = married,Healthy = health) %>%
select(-insurance, -selfemp, -age, -family, -ethnicity, -education, -region, -married, -health)
age_subset <- select(data_new, 'Age','Insurance')
head (age_subset,10)
education_subset <- select(data_new, 'Highest_Education_Attained','Insurance')
head (education_subset,10)
Below I used histograms and bar graphs to represent my data. Below we can see the distribution of both the age and education within my data set.
ggplot(education_subset, aes(x = Highest_Education_Attained, fill = Insurance)) + geom_bar(position = "stack") + labs(title = "Education Distribution by Insurance",x = "Highest Education Attained", y = "Count",fill = "Insurance")
ggplot(education_subset, aes(x = Highest_Education_Attained, fill = Insurance)) + geom_bar(position = "fill", stat = "count") + labs(title = "Education Level vs Insurance (Percentage)", x = "Highest Education Attained", y = "Percentage", fill = "Insurance")
data_new$Age_Distribution <- cut(age_subset$Age, breaks = c(0, 20, 30, 40, 50, 60, Inf), labels = c("0-20", "21-30", "31-40", "41-50", "51-60", "61+"))
ggplot(data_new, aes(x = Age_Distribution, fill = Insurance)) + geom_bar(position = "fill") + labs(title = "Age vs Insurance (Percentage)",x = "Age Distribution",y = "Percentage",fill = "Insurance")
My file is too large even though the code is correct which is why I cant connect to the file. I have attached the code below to show proof of my work.
## My file is too large even though the code is correct which is why I cant connect to the file
## github_url <- "https://raw.githubusercontent.com/MAB592/R-Final-Project/main/HealthInsurance.csv"
## git_data <- read.csv(github_url)
## head(git_data, 10)
Based on the provided data, there appears to be a positive correlation between age and education. The bar graph illustrates that the majority of individuals in this dataset hold a high school degree, with most falling within the 30-40 age range.
To gain deeper insights from the data, I utilized graphs to represent the information as percentages. Observing the educational data, it becomes apparent that individuals who advance to higher levels of education are more likely to have insurance coverage than those with lower forms of education.Also analyzing the age data reveals a steady increase in insurance coverage across different age ranges until reaching the 40s, where the trend seems to stabilize.
Overall, the data suggests a positive relationship between age, education, and insurance coverage.