This is my demonstration for having met the STA 631 Course objectives;
Executive Summary This report provides a comprehensive analysis of a statistical modeling project, focusing on the application of probability theory, generalized linear models (GLMs), model selection, programming software (R), and effective communication of results to a general audience to the fulfillment of STA 631 Course objectives. The project aims to develop a predictive model for a specific data context, utilizing linear regression as the primary statistical method. Key findings and insights from the analysis are presented, along with recommendations for further exploration and refinement.
Introduction
Statistical modeling is a powerful tool for analyzing and interpreting complex data sets, providing insights into relationships between variables and making predictions based on observed patterns. Probability theory forms the foundation of statistical modeling, allowing researchers to quantify uncertainty and make probabilistic inferences about parameters of interest. In this project, I explored the application of statistical modeling techniques to a real-world data set, with a focus on linear regression as a means of predicting a continuous outcome variable.
OBJECTIVE 1: Probability as a Foundation of Statistical Modeling
Probability theory plays a fundamental role in statistical modeling, underpinning the estimation of model parameters and the assessment of uncertainty. In our analysis, we leverage probability distributions to model the variability in the data and make probabilistic statements about the parameters of our linear regression model. Specifically, we assume that the errors follow a normal distribution with a mean of zero and constant variance, which is a common assumption in linear regression modeling.
Maximum likelihood estimation (MLE) is a key concept in statistical
modeling, involving the optimization of the likelihood function to
estimate the parameters of the model. In our project, we employ the
lm() function in R to fit the linear regression model to
the data, implicitly utilizing MLE to estimate the regression
coefficients. By maximizing the likelihood function, we obtain parameter
estimates that best explain the observed data, allowing us to make
predictions and draw inferences about the relationship between the
predictor variables and the outcome variable.
OBJECTIVE 2: Application of Generalized Linear Models (GLMs)
Generalized linear models (GLMs) provide a flexible framework for modeling a wide range of data types and response distributions. In our project, we focus on fitting a linear regression model, which is a type of GLM suitable for continuous outcome variables. However, GLMs encompass a variety of modeling techniques, including logistic regression for binary outcomes and Poisson regression for count data.
The choice of the appropriate GLM depends on the nature of the data and the research question of interest. In our analysis, we carefully consider the characteristics of the outcome variable and select the linear regression model as the most appropriate approach for our data context. By understanding the principles of GLMs, we are able to effectively model the relationship between the predictor variables and the continuous outcome variable, providing valuable insights into the underlying processes.
OBJECTIVE 3: Model Selection
Model selection is a critical step in the statistical modeling process, involving the comparison of different candidate models to identify the most appropriate one based on criteria such as goodness of fit, complexity, and predictive performance. While our analysis focuses on fitting a single linear regression model, model selection remains an important consideration for future exploration.
In practice, model selection may involve fitting alternative models with different sets of predictor variables, exploring nonlinear relationships, or considering alternative distributional assumptions. By systematically comparing the performance of different models, researchers can identify the best-fitting model for their data context and draw more robust conclusions from the analysis.
OBJECTIVE 4: Use of Programming Software (R)
Programming software such as R provides a powerful platform for conducting statistical analysis and building predictive models. In these projects, I(we) leverage the capabilities of R to fit and assess the linear regression model, utilizing functions and packages specifically designed for statistical modeling.
The lm() function in R is used to fit the linear
regression model to the data, providing estimates of the regression
coefficients and other relevant statistics. Additionally, we utilize
various R packages for data manipulation, visualization, and model
diagnostics, enhancing the efficiency and reproducibility of our
analysis.
OBJECTIVE 5:Communication of Results to a General Audience
Effective communication of statistical results is essential for ensuring that the findings are accessible and actionable to a general audience. In this project, I employ clear and concise language to describe the key findings and insights from the analysis, avoiding technical jargon and providing intuitive explanations of complex concepts.
Visualizations such as scatter plots, regression diagnostics, and summary tables are used to illustrate the relationships between variables and summarize the main results of the analysis. Additionally, I provide interpretations of the regression coefficients and discuss the practical implications of the findings in the context of the research question.
1. Introduction: The aim of this project was to analyze Medicare hospital payment data and develop a predictive model to estimate the total payment amount based on various factors. The dataset used for analysis contains information on hospital payments, Medicare payments, and demographic variables.
library(tidymodels)
library(tidyverse)
library(dplyr)
library(lmtest)
library(car)
medicare_provider <- read.csv("Medicare_Inpatient_Hospital_by_Provider_and_Service_2018_data.csv")
hosp_gen_info <- read.csv("Hospital General Information.csv")
census <- read.csv("ACSDT5Y2017.B16010-Data.csv")
2. Data Preprocessing:
The dataset was first cleaned to handle missing values and outliers. Relevant variables were selected for analysis, including average Medicare payment amount, demographic variables, and hospital payment information. Correlation analysis was performed to identify highly correlated variables, and redundant variables were dropped to avoid multicollinearity.
#Modifying the Census Data set
census <- census[-1,]
census$GEO_ID <- substr(census$GEO_ID, 10, nchar(census$GEO_ID))
# print a concise summary of the census dataset, showing the variable names, data types, and the first few observations of each variable.
# Modifying Hospital General Information
hosp_gen_info$ZIP.Code <- as.character(hosp_gen_info$ZIP.Code)
hosp_gen_info$ZIP.Code <- ifelse(nchar(hosp_gen_info$ZIP.Code) == 5, hosp_gen_info$ZIP.Code,
ifelse(nchar(hosp_gen_info$ZIP.Code) == 4,
paste0("0", ... = hosp_gen_info$ZIP.Code),
paste0("00", hosp_gen_info$ZIP.Code)))
#Here we modify the ZIP.Code column in the hosp_gen_info data frame by ensuring that all values are five characters long with leading zeroes if necessary.
# Modifying medicare_provider Data
medicare_provider <- medicare_provider %>% mutate(Rndrng_Prvdr_CCN = as.character(Rndrng_Prvdr_CCN))
hosp_gen_info <- hosp_gen_info %>% mutate(Provider.ID = as.character(Provider.ID))
# Here we modify the Rndrng_Prvdr_CCN and Provider.ID columns in the medicare_provider and hosp_gen_info data frames respectively by converting their data types from numeric to character. This is usually done to allow for easier string manipulations and comparisons.
med_hos_census <- medicare_provider %>%
select(Rndrng_Prvdr_CCN, Rndrng_Prvdr_State_Abrvtn, Rndrng_Prvdr_State_FIPS, Rndrng_Prvdr_Zip5, Rndrng_Prvdr_RUCA, DRG_Cd, Tot_Dschrgs,
Avg_Submtd_Cvrd_Chrg, Avg_Tot_Pymt_Amt, Avg_Mdcr_Pymt_Amt) %>%
inner_join(hosp_gen_info, by=c("Rndrng_Prvdr_CCN" = "Provider.ID")) %>%
select(Rndrng_Prvdr_State_Abrvtn, Rndrng_Prvdr_State_FIPS, Rndrng_Prvdr_Zip5, Rndrng_Prvdr_RUCA, DRG_Cd, Tot_Dschrgs,
Avg_Submtd_Cvrd_Chrg, Avg_Tot_Pymt_Amt, Avg_Mdcr_Pymt_Amt, ZIP.Code, State, Hospital.Ownership, Hospital.overall.rating) %>%
inner_join(census, by=c("ZIP.Code" = "GEO_ID")) %>%
select(-NAME)
#Here we merge data from multiple data frames to create a new data frame called med_hos_census that contains information about Medicare providers, hospitals, and census data based on their zip codes.
sort(table(med_hos_census$Hospital.Ownership), decreasing = TRUE)
##
## Voluntary non-profit - Private
## 93321
## Proprietary
## 27809
## Voluntary non-profit - Church
## 17382
## Voluntary non-profit - Other
## 17026
## Government - Hospital District or Authority
## 11765
## Government - Local
## 5397
## Government - State
## 2954
## Physician
## 878
## Government - Federal
## 521
## Tribal
## 19
# Here we try to determine information about the ownership types of hospitals in the med_hos_census data frame, by counting the number of occurrences of each unique value and sorting them in descending order.
#Converting the Hospital.Ownership column in the med_hos_census data frame to a factor and then to a numeric variable
med_hos_census <- med_hos_census %>% mutate(Hospital.Ownership = factor(Hospital.Ownership)) %>%
mutate(Hospital.Ownership = as.numeric(Hospital.Ownership))
med_hos_census <- med_hos_census %>%
select(-ZIP.Code, -Rndrng_Prvdr_State_Abrvtn)
# Removing the ZIP.Code and Rndrng_Prvdr_State_Abrvtn columns from the med_hos_census data frame
med_hos_census <- med_hos_census %>%
filter_all(any_vars(!str_detect(Hospital.overall.rating, "Not Available")))
med_hos_census <- med_hos_census %>%
distinct()
i <- 10:117
med_hos_census[, i] <- apply(med_hos_census[, i], 2, as.numeric)
str(med_hos_census)
## 'data.frame': 176095 obs. of 117 variables:
## $ Rndrng_Prvdr_State_FIPS: int 1 1 1 1 1 1 1 1 1 1 ...
## $ Rndrng_Prvdr_Zip5 : int 36301 36301 36301 36301 36301 36301 36301 36301 36301 36301 ...
## $ Rndrng_Prvdr_RUCA : num 1 1 1 1 1 1 1 1 1 1 ...
## $ DRG_Cd : int 3 23 25 38 39 57 64 65 66 69 ...
## $ Tot_Dschrgs : int 13 33 26 11 64 30 115 107 17 53 ...
## $ Avg_Submtd_Cvrd_Chrg : num 368434 148677 118718 74449 46628 ...
## $ Avg_Tot_Pymt_Amt : num 81541 29062 22442 9546 6468 ...
## $ Avg_Mdcr_Pymt_Amt : num 80435 27997 19592 7562 5073 ...
## $ State : chr "AL" "AL" "AL" "AL" ...
## $ Hospital.Ownership : num 2 2 2 2 2 2 2 2 2 2 ...
## $ Hospital.overall.rating: num 3 3 3 3 3 3 3 3 3 3 ...
## $ B16010_001E : num 25008 25008 25008 25008 25008 ...
## $ B16010_001M : num 426 426 426 426 426 426 426 426 426 426 ...
## $ B16010_002E : num 3926 3926 3926 3926 3926 ...
## $ B16010_002M : num 237 237 237 237 237 237 237 237 237 237 ...
## $ B16010_003E : num 1587 1587 1587 1587 1587 ...
## $ B16010_003M : num 186 186 186 186 186 186 186 186 186 186 ...
## $ B16010_004E : num 1447 1447 1447 1447 1447 ...
## $ B16010_004M : num 171 171 171 171 171 171 171 171 171 171 ...
## $ B16010_005E : num 123 123 123 123 123 123 123 123 123 123 ...
## $ B16010_005M : num 70 70 70 70 70 70 70 70 70 70 ...
## $ B16010_006E : num 0 0 0 0 0 0 0 0 0 0 ...
## $ B16010_006M : num 24 24 24 24 24 24 24 24 24 24 ...
## $ B16010_007E : num 15 15 15 15 15 15 15 15 15 15 ...
## $ B16010_007M : num 13 13 13 13 13 13 13 13 13 13 ...
## $ B16010_008E : num 2 2 2 2 2 2 2 2 2 2 ...
## $ B16010_008M : num 3 3 3 3 3 3 3 3 3 3 ...
## $ B16010_009E : num 2339 2339 2339 2339 2339 ...
## $ B16010_009M : num 173 173 173 173 173 173 173 173 173 173 ...
## $ B16010_010E : num 2218 2218 2218 2218 2218 ...
## $ B16010_010M : num 173 173 173 173 173 173 173 173 173 173 ...
## $ B16010_011E : num 92 92 92 92 92 92 92 92 92 92 ...
## $ B16010_011M : num 39 39 39 39 39 39 39 39 39 39 ...
## $ B16010_012E : num 5 5 5 5 5 5 5 5 5 5 ...
## $ B16010_012M : num 6 6 6 6 6 6 6 6 6 6 ...
## $ B16010_013E : num 22 22 22 22 22 22 22 22 22 22 ...
## $ B16010_013M : num 21 21 21 21 21 21 21 21 21 21 ...
## $ B16010_014E : num 2 2 2 2 2 2 2 2 2 2 ...
## $ B16010_014M : num 3 3 3 3 3 3 3 3 3 3 ...
## $ B16010_015E : num 8788 8788 8788 8788 8788 ...
## $ B16010_015M : num 391 391 391 391 391 391 391 391 391 391 ...
## $ B16010_016E : num 4916 4916 4916 4916 4916 ...
## $ B16010_016M : num 298 298 298 298 298 298 298 298 298 298 ...
## $ B16010_017E : num 4712 4712 4712 4712 4712 ...
## $ B16010_017M : num 287 287 287 287 287 287 287 287 287 287 ...
## $ B16010_018E : num 170 170 170 170 170 170 170 170 170 170 ...
## $ B16010_018M : num 70 70 70 70 70 70 70 70 70 70 ...
## $ B16010_019E : num 4 4 4 4 4 4 4 4 4 4 ...
## $ B16010_019M : num 6 6 6 6 6 6 6 6 6 6 ...
## $ B16010_020E : num 30 30 30 30 30 30 30 30 30 30 ...
## $ B16010_020M : num 28 28 28 28 28 28 28 28 28 28 ...
## $ B16010_021E : num 0 0 0 0 0 0 0 0 0 0 ...
## $ B16010_021M : num 24 24 24 24 24 24 24 24 24 24 ...
## $ B16010_022E : num 3872 3872 3872 3872 3872 ...
## $ B16010_022M : num 262 262 262 262 262 262 262 262 262 262 ...
## $ B16010_023E : num 3838 3838 3838 3838 3838 ...
## $ B16010_023M : num 261 261 261 261 261 261 261 261 261 261 ...
## $ B16010_024E : num 9 9 9 9 9 9 9 9 9 9 ...
## $ B16010_024M : num 8 8 8 8 8 8 8 8 8 8 ...
## $ B16010_025E : num 11 11 11 11 11 11 11 11 11 11 ...
## $ B16010_025M : num 12 12 12 12 12 12 12 12 12 12 ...
## $ B16010_026E : num 14 14 14 14 14 14 14 14 14 14 ...
## $ B16010_026M : num 14 14 14 14 14 14 14 14 14 14 ...
## $ B16010_027E : num 0 0 0 0 0 0 0 0 0 0 ...
## $ B16010_027M : num 24 24 24 24 24 24 24 24 24 24 ...
## $ B16010_028E : num 8106 8106 8106 8106 8106 ...
## $ B16010_028M : num 390 390 390 390 390 390 390 390 390 390 ...
## $ B16010_029E : num 5354 5354 5354 5354 5354 ...
## $ B16010_029M : num 314 314 314 314 314 314 314 314 314 314 ...
## $ B16010_030E : num 5185 5185 5185 5185 5185 ...
## $ B16010_030M : num 311 311 311 311 311 311 311 311 311 311 ...
## $ B16010_031E : num 137 137 137 137 137 137 137 137 137 137 ...
## $ B16010_031M : num 61 61 61 61 61 61 61 61 61 61 ...
## $ B16010_032E : num 14 14 14 14 14 14 14 14 14 14 ...
## $ B16010_032M : num 8 8 8 8 8 8 8 8 8 8 ...
## $ B16010_033E : num 15 15 15 15 15 15 15 15 15 15 ...
## $ B16010_033M : num 10 10 10 10 10 10 10 10 10 10 ...
## $ B16010_034E : num 3 3 3 3 3 3 3 3 3 3 ...
## $ B16010_034M : num 5 5 5 5 5 5 5 5 5 5 ...
## $ B16010_035E : num 2752 2752 2752 2752 2752 ...
## $ B16010_035M : num 190 190 190 190 190 190 190 190 190 190 ...
## $ B16010_036E : num 2687 2687 2687 2687 2687 ...
## $ B16010_036M : num 182 182 182 182 182 182 182 182 182 182 ...
## $ B16010_037E : num 21 21 21 21 21 21 21 21 21 21 ...
## $ B16010_037M : num 13 13 13 13 13 13 13 13 13 13 ...
## $ B16010_038E : num 35 35 35 35 35 35 35 35 35 35 ...
## $ B16010_038M : num 22 22 22 22 22 22 22 22 22 22 ...
## $ B16010_039E : num 9 9 9 9 9 9 9 9 9 9 ...
## $ B16010_039M : num 9 9 9 9 9 9 9 9 9 9 ...
## $ B16010_040E : num 0 0 0 0 0 0 0 0 0 0 ...
## $ B16010_040M : num 24 24 24 24 24 24 24 24 24 24 ...
## $ B16010_041E : num 4188 4188 4188 4188 4188 ...
## $ B16010_041M : num 247 247 247 247 247 247 247 247 247 247 ...
## $ B16010_042E : num 2933 2933 2933 2933 2933 ...
## $ B16010_042M : num 217 217 217 217 217 217 217 217 217 217 ...
## $ B16010_043E : num 2851 2851 2851 2851 2851 ...
## $ B16010_043M : num 213 213 213 213 213 213 213 213 213 213 ...
## $ B16010_044E : num 31 31 31 31 31 31 31 31 31 31 ...
## $ B16010_044M : num 12 12 12 12 12 12 12 12 12 12 ...
## [list output truncated]
med_hos_census <- med_hos_census[med_hos_census$DRG_Cd %in% c("280", "281", "282", "246", "247", "248", "249", "250", "251", "252"), ]
# install.packages("maps")
library(maps)
# Group data by state and find the average amount for each state
avg_amounts <- med_hos_census %>%
group_by(State) %>%
summarise(avg_amount = mean(Avg_Tot_Pymt_Amt))
# use state.abb and state.name to match abbreviations to full names, and replace missing or unrecognized abbreviations with "Washington, DC"
avg_amounts$State <- ifelse(avg_amounts$State %in% state.abb, state.name[match(avg_amounts$State, state.abb)], "Washington, DC")
# Print the result
print(avg_amounts$State)
## [1] "Alaska" "Alabama" "Arkansas" "Arizona"
## [5] "California" "Colorado" "Connecticut" "Washington, DC"
## [9] "Delaware" "Florida" "Georgia" "Hawaii"
## [13] "Iowa" "Idaho" "Illinois" "Indiana"
## [17] "Kansas" "Kentucky" "Louisiana" "Massachusetts"
## [21] "Maryland" "Maine" "Michigan" "Minnesota"
## [25] "Missouri" "Mississippi" "Montana" "North Carolina"
## [29] "North Dakota" "Nebraska" "New Hampshire" "New Jersey"
## [33] "New Mexico" "Nevada" "New York" "Ohio"
## [37] "Oklahoma" "Oregon" "Pennsylvania" "Rhode Island"
## [41] "South Carolina" "South Dakota" "Tennessee" "Texas"
## [45] "Utah" "Virginia" "Vermont" "Washington"
## [49] "Wisconsin" "West Virginia" "Wyoming"
# Convert the state names to lower
avg_amounts$State <- tolower(avg_amounts$State)
# join the map data and the avg_amounts data
map_df <- left_join(map_data("state"), avg_amounts, by = c("region" = "State"))
# Create the heatmap using ggplot2
ggplot(data = map_df, aes(x = long, y = lat, group = group, fill = avg_amount)) +
geom_polygon() +
scale_fill_gradient(low = "white", high = "red", name = "Average Amount") +
theme_void()
# Selecting the best subset of variables used for the modelling process
med_hos_census <- med_hos_census %>%
select(-State)
correlations <- cor(med_hos_census[,-med_hos_census$Avg_Tot_Pymt_Amt], med_hos_census$Avg_Tot_Pymt_Amt)
# Computing correlations
N <- 10
top_vars <- names(correlations)[order(abs(correlations), decreasing = TRUE)[1:N]]
# Selecting variables with highest absolute correlations then Sort in descending order and selecting top N variables
abs(correlations) > 0.11
## [,1]
## Rndrng_Prvdr_State_FIPS FALSE
## Rndrng_Prvdr_Zip5 FALSE
## Rndrng_Prvdr_RUCA FALSE
## DRG_Cd TRUE
## Tot_Dschrgs FALSE
## Avg_Submtd_Cvrd_Chrg TRUE
## Avg_Tot_Pymt_Amt TRUE
## Avg_Mdcr_Pymt_Amt TRUE
## Hospital.Ownership FALSE
## Hospital.overall.rating FALSE
## B16010_001E FALSE
## B16010_001M FALSE
## B16010_002E FALSE
## B16010_002M FALSE
## B16010_003E FALSE
## B16010_003M FALSE
## B16010_004E FALSE
## B16010_004M FALSE
## B16010_005E FALSE
## B16010_005M FALSE
## B16010_006E FALSE
## B16010_006M FALSE
## B16010_007E FALSE
## B16010_007M FALSE
## B16010_008E FALSE
## B16010_008M FALSE
## B16010_009E FALSE
## B16010_009M FALSE
## B16010_010E FALSE
## B16010_010M FALSE
## B16010_011E FALSE
## B16010_011M TRUE
## B16010_012E FALSE
## B16010_012M FALSE
## B16010_013E TRUE
## B16010_013M TRUE
## B16010_014E FALSE
## B16010_014M FALSE
## B16010_015E FALSE
## B16010_015M FALSE
## B16010_016E FALSE
## B16010_016M FALSE
## B16010_017E TRUE
## B16010_017M FALSE
## B16010_018E FALSE
## B16010_018M FALSE
## B16010_019E FALSE
## B16010_019M FALSE
## B16010_020E FALSE
## B16010_020M TRUE
## B16010_021E FALSE
## B16010_021M FALSE
## B16010_022E FALSE
## B16010_022M FALSE
## B16010_023E TRUE
## B16010_023M TRUE
## B16010_024E FALSE
## B16010_024M FALSE
## B16010_025E FALSE
## B16010_025M FALSE
## B16010_026E TRUE
## B16010_026M TRUE
## B16010_027E FALSE
## B16010_027M FALSE
## B16010_028E FALSE
## B16010_028M FALSE
## B16010_029E FALSE
## B16010_029M FALSE
## B16010_030E FALSE
## B16010_030M FALSE
## B16010_031E FALSE
## B16010_031M FALSE
## B16010_032E FALSE
## B16010_032M FALSE
## B16010_033E TRUE
## B16010_033M TRUE
## B16010_034E FALSE
## B16010_034M FALSE
## B16010_035E FALSE
## B16010_035M FALSE
## B16010_036E FALSE
## B16010_036M FALSE
## B16010_037E FALSE
## B16010_037M FALSE
## B16010_038E FALSE
## B16010_038M FALSE
## B16010_039E TRUE
## B16010_039M TRUE
## B16010_040E FALSE
## B16010_040M FALSE
## B16010_041E FALSE
## B16010_041M FALSE
## B16010_042E FALSE
## B16010_042M FALSE
## B16010_043E FALSE
## B16010_043M FALSE
## B16010_044E FALSE
## B16010_044M TRUE
## B16010_045E TRUE
## B16010_045M TRUE
## B16010_046E TRUE
## B16010_046M TRUE
## B16010_047E FALSE
## B16010_047M TRUE
## B16010_048E FALSE
## B16010_048M FALSE
## B16010_049E FALSE
## B16010_049M FALSE
## B16010_050E FALSE
## B16010_050M FALSE
## B16010_051E FALSE
## B16010_051M TRUE
## B16010_052E TRUE
## B16010_052M TRUE
## B16010_053E FALSE
## B16010_053M FALSE
# Filtering out correlations that are considered weak or not significant, based on a predetermined threshold value (in our case, 0.11).
library(ggplot2)
# Select top correlated variables
top_vars <- c("DRG_Cd", "Avg_Submtd_Cvrd_Chrg", "Avg_Mdcr_Pymt_Amt",
"B16010_011M", "B16010_013E", "B16010_013M", "B16010_017E", "B16010_020M",
"B16010_023E", "B16010_023M", "B16010_026E", "B16010_026M", "B16010_033E",
"B16010_033M", "B16010_039E", "B16010_039M", "B16010_044M", "B16010_045E",
"B16010_045M", "B16010_046E", "B16010_046M", "B16010_047M", "B16010_051M",
"B16010_052E", "B16010_052M")
# Create scatter plots for top correlated variables
for (var in top_vars) {
print(ggplot(med_hos_census, aes(x = Avg_Tot_Pymt_Amt, y = .data[[var]])) +
geom_point() +
labs(title = paste0("Scatter plot of ", var, " vs. Avg_Tot_Pymt_Amt"),
x = "Avg_Tot_Pymt_Amt",
y = var))
}
# Select top correlated variables
top_vars <- c("DRG_Cd", "Avg_Submtd_Cvrd_Chrg", "Avg_Mdcr_Pymt_Amt",
"B16010_011M", "B16010_013E", "B16010_013M", "B16010_017E", "B16010_020M",
"B16010_023E", "B16010_023M", "B16010_026E", "B16010_026M", "B16010_033E",
"B16010_033M", "B16010_039E", "B16010_039M", "B16010_044M", "B16010_045E",
"B16010_045M", "B16010_046E", "B16010_046M", "B16010_047M", "B16010_051M",
"B16010_052E", "B16010_052M")
# Calculate correlation matrix for top variables
corr_matrix <- cor(med_hos_census[, top_vars])
# Reshape correlation matrix
corr_df <- reshape2::melt(corr_matrix)
# Create heatmap of correlation matrix
ggplot(corr_df, aes(Var2, Var1, fill = value)) +
geom_tile() +
scale_fill_gradient(low = "blue", high = "red") +
labs(title = "Correlation heatmap of top variables",
x = "Variable",
y = "Variable",
fill = "Correlation")
corr_matrix
## DRG_Cd Avg_Submtd_Cvrd_Chrg Avg_Mdcr_Pymt_Amt
## DRG_Cd 1.0000000000 -0.59504752 -0.6843227
## Avg_Submtd_Cvrd_Chrg -0.5950475186 1.00000000 0.6303394
## Avg_Mdcr_Pymt_Amt -0.6843226998 0.63033937 1.0000000
## B16010_011M -0.0093420444 0.19526403 0.1239541
## B16010_013E -0.0058823638 0.05009825 0.1204442
## B16010_013M -0.0184633360 0.11127208 0.1670829
## B16010_017E 0.0175233248 -0.10244745 -0.1012966
## B16010_020M -0.0154880304 0.12944555 0.1228202
## B16010_023E 0.0271551045 -0.10141983 -0.1160702
## B16010_023M 0.0156219500 -0.04153485 -0.1024816
## B16010_026E -0.0105518003 0.08227392 0.1255242
## B16010_026M -0.0213639012 0.13328684 0.1528921
## B16010_033E -0.0148287598 0.13377558 0.1357005
## B16010_033M -0.0233867817 0.16023463 0.1363107
## B16010_039E -0.0122097734 0.12600861 0.1354829
## B16010_039M -0.0129539062 0.15360731 0.1494660
## B16010_044M -0.0209400804 0.19836964 0.1209962
## B16010_045E 0.0002387518 0.06715876 0.1227760
## B16010_045M -0.0179498055 0.10300747 0.1419226
## B16010_046E -0.0206744271 0.13451410 0.1575087
## B16010_046M -0.0311512626 0.16356878 0.1805934
## B16010_047M -0.0193404900 0.05494069 0.1128502
## B16010_051M -0.0130266533 0.10003926 0.1139435
## B16010_052E -0.0223397521 0.15270477 0.1627974
## B16010_052M -0.0281922848 0.17220592 0.1838104
## B16010_011M B16010_013E B16010_013M B16010_017E
## DRG_Cd -0.009342044 -0.005882364 -0.018463336 0.017523325
## Avg_Submtd_Cvrd_Chrg 0.195264027 0.050098252 0.111272078 -0.102447446
## Avg_Mdcr_Pymt_Amt 0.123954069 0.120444173 0.167082901 -0.101296638
## B16010_011M 1.000000000 0.220832062 0.348507648 0.079082277
## B16010_013E 0.220832062 1.000000000 0.802401796 -0.032519065
## B16010_013M 0.348507648 0.802401796 1.000000000 0.005674263
## B16010_017E 0.079082277 -0.032519065 0.005674263 1.000000000
## B16010_020M 0.347296803 0.680972086 0.720300816 0.095929149
## B16010_023E 0.025020734 -0.025247640 -0.021101275 0.851338726
## B16010_023M 0.152588385 0.013953141 0.069242694 0.743115601
## B16010_026E 0.228112158 0.911413965 0.747122867 -0.033544538
## B16010_026M 0.316182070 0.687411724 0.755955695 0.011342284
## B16010_033E 0.266717371 0.812131544 0.748912319 -0.012622458
## B16010_033M 0.322815409 0.619535923 0.713053042 0.040769347
## B16010_039E 0.245048815 0.825327604 0.725892046 -0.038938814
## B16010_039M 0.321924554 0.610399829 0.719529460 -0.000814911
## B16010_044M 0.654290276 0.177643171 0.326609617 -0.045663101
## B16010_045E 0.139610954 0.285631386 0.420131032 -0.098509503
## B16010_045M 0.195747560 0.278171478 0.440203378 -0.061217713
## B16010_046E 0.191190234 0.545553663 0.636534478 -0.119871077
## B16010_046M 0.256476155 0.439678753 0.630355905 -0.109811963
## B16010_047M 0.204974283 0.161054192 0.305060970 0.004850307
## B16010_051M 0.160318997 0.278894193 0.399099371 -0.067289427
## B16010_052E 0.177080659 0.558934703 0.627133041 -0.127124843
## B16010_052M 0.226799121 0.411043015 0.590723886 -0.109527128
## B16010_020M B16010_023E B16010_023M B16010_026E
## DRG_Cd -0.01548803 0.02715510 0.015621950 -0.010551800
## Avg_Submtd_Cvrd_Chrg 0.12944555 -0.10141983 -0.041534849 0.082273923
## Avg_Mdcr_Pymt_Amt 0.12282018 -0.11607018 -0.102481580 0.125524178
## B16010_011M 0.34729680 0.02502073 0.152588385 0.228112158
## B16010_013E 0.68097209 -0.02524764 0.013953141 0.911413965
## B16010_013M 0.72030082 -0.02110128 0.069242694 0.747122867
## B16010_017E 0.09592915 0.85133873 0.743115601 -0.033544538
## B16010_020M 1.00000000 0.02944421 0.118007342 0.709129670
## B16010_023E 0.02944421 1.00000000 0.840107594 -0.036207237
## B16010_023M 0.11800734 0.84010759 1.000000000 0.008247335
## B16010_026E 0.70912967 -0.03620724 0.008247335 1.000000000
## B16010_026M 0.74401109 -0.02077294 0.069987037 0.833830765
## B16010_033E 0.73515912 -0.04465949 0.011980065 0.880398985
## B16010_033M 0.74618785 -0.02837401 0.079092166 0.693686921
## B16010_039E 0.68710095 -0.05505210 0.002266774 0.922787851
## B16010_039M 0.67235566 -0.03717064 0.079207771 0.714824263
## B16010_044M 0.35239384 -0.11557204 0.065402059 0.207343564
## B16010_045E 0.34513295 -0.10734224 -0.045640518 0.320592778
## B16010_045M 0.40274012 -0.10670798 0.012812507 0.320799883
## B16010_046E 0.57712643 -0.13993746 -0.070764215 0.660391596
## B16010_046M 0.58894715 -0.15666498 -0.030636220 0.537631680
## B16010_047M 0.31302259 -0.06441251 0.056306141 0.192017714
## B16010_051M 0.35628314 -0.05471781 0.051655225 0.316007002
## B16010_052E 0.56550055 -0.14186545 -0.081527991 0.684808396
## B16010_052M 0.51316140 -0.13556540 -0.031345269 0.502989932
## B16010_026M B16010_033E B16010_033M B16010_039E
## DRG_Cd -0.02136390 -0.01482876 -0.02338678 -0.012209773
## Avg_Submtd_Cvrd_Chrg 0.13328684 0.13377558 0.16023463 0.126008608
## Avg_Mdcr_Pymt_Amt 0.15289210 0.13570048 0.13631073 0.135482911
## B16010_011M 0.31618207 0.26671737 0.32281541 0.245048815
## B16010_013E 0.68741172 0.81213154 0.61953592 0.825327604
## B16010_013M 0.75595569 0.74891232 0.71305304 0.725892046
## B16010_017E 0.01134228 -0.01262246 0.04076935 -0.038938814
## B16010_020M 0.74401109 0.73515912 0.74618785 0.687100954
## B16010_023E -0.02077294 -0.04465949 -0.02837401 -0.055052096
## B16010_023M 0.06998704 0.01198006 0.07909217 0.002266774
## B16010_026E 0.83383076 0.88039898 0.69368692 0.922787851
## B16010_026M 1.00000000 0.77201952 0.74645575 0.776211458
## B16010_033E 0.77201952 1.00000000 0.84481632 0.922057177
## B16010_033M 0.74645575 0.84481632 1.00000000 0.733041903
## B16010_039E 0.77621146 0.92205718 0.73304190 1.000000000
## B16010_039M 0.76152568 0.75193466 0.74518274 0.841020654
## B16010_044M 0.33491891 0.26314752 0.39693371 0.239489522
## B16010_045E 0.44087304 0.33261400 0.38720305 0.337120048
## B16010_045M 0.46710813 0.35657069 0.46850172 0.360386702
## B16010_046E 0.71162877 0.73980354 0.67257932 0.733807208
## B16010_046M 0.68653649 0.60600486 0.68393399 0.592229111
## B16010_047M 0.34353988 0.23793058 0.34845687 0.228496959
## B16010_051M 0.42197305 0.31872011 0.41375714 0.335149231
## B16010_052E 0.71279312 0.74482026 0.66630476 0.759122695
## B16010_052M 0.62267860 0.55979085 0.61624671 0.562627257
## B16010_039M B16010_044M B16010_045E B16010_045M
## DRG_Cd -0.012953906 -0.02094008 0.0002387518 -0.01794981
## Avg_Submtd_Cvrd_Chrg 0.153607309 0.19836964 0.0671587616 0.10300747
## Avg_Mdcr_Pymt_Amt 0.149465992 0.12099621 0.1227760445 0.14192264
## B16010_011M 0.321924554 0.65429028 0.1396109538 0.19574756
## B16010_013E 0.610399829 0.17764317 0.2856313857 0.27817148
## B16010_013M 0.719529460 0.32660962 0.4201310319 0.44020338
## B16010_017E -0.000814911 -0.04566310 -0.0985095033 -0.06121771
## B16010_020M 0.672355665 0.35239384 0.3451329475 0.40274012
## B16010_023E -0.037170642 -0.11557204 -0.1073422391 -0.10670798
## B16010_023M 0.079207771 0.06540206 -0.0456405176 0.01281251
## B16010_026E 0.714824263 0.20734356 0.3205927783 0.32079988
## B16010_026M 0.761525679 0.33491891 0.4408730367 0.46710813
## B16010_033E 0.751934665 0.26314752 0.3326140047 0.35657069
## B16010_033M 0.745182744 0.39693371 0.3872030528 0.46850172
## B16010_039E 0.841020654 0.23948952 0.3371200483 0.36038670
## B16010_039M 1.000000000 0.36011671 0.4169364683 0.47586834
## B16010_044M 0.360116711 1.00000000 0.3805215555 0.51544047
## B16010_045E 0.416936468 0.38052156 1.0000000000 0.85268067
## B16010_045M 0.475868335 0.51544047 0.8526806721 1.00000000
## B16010_046E 0.709656108 0.35517709 0.6665811800 0.64560456
## B16010_046M 0.689663940 0.50490238 0.6591301113 0.74710639
## B16010_047M 0.327867940 0.44426616 0.5053855782 0.60700751
## B16010_051M 0.429513581 0.43070554 0.7599227336 0.79956659
## B16010_052E 0.706475124 0.32638128 0.6003144474 0.59921628
## B16010_052M 0.643588208 0.41548978 0.5562221069 0.64019103
## B16010_046E B16010_046M B16010_047M B16010_051M
## DRG_Cd -0.02067443 -0.03115126 -0.019340490 -0.01302665
## Avg_Submtd_Cvrd_Chrg 0.13451410 0.16356878 0.054940690 0.10003926
## Avg_Mdcr_Pymt_Amt 0.15750867 0.18059345 0.112850240 0.11394349
## B16010_011M 0.19119023 0.25647615 0.204974283 0.16031900
## B16010_013E 0.54555366 0.43967875 0.161054192 0.27889419
## B16010_013M 0.63653448 0.63035591 0.305060970 0.39909937
## B16010_017E -0.11987108 -0.10981196 0.004850307 -0.06728943
## B16010_020M 0.57712643 0.58894715 0.313022592 0.35628314
## B16010_023E -0.13993746 -0.15666498 -0.064412509 -0.05471781
## B16010_023M -0.07076421 -0.03063622 0.056306141 0.05165522
## B16010_026E 0.66039160 0.53763168 0.192017714 0.31600700
## B16010_026M 0.71162877 0.68653649 0.343539883 0.42197305
## B16010_033E 0.73980354 0.60600486 0.237930576 0.31872011
## B16010_033M 0.67257932 0.68393399 0.348456869 0.41375714
## B16010_039E 0.73380721 0.59222911 0.228496959 0.33514923
## B16010_039M 0.70965611 0.68966394 0.327867940 0.42951358
## B16010_044M 0.35517709 0.50490238 0.444266156 0.43070554
## B16010_045E 0.66658118 0.65913011 0.505385578 0.75992273
## B16010_045M 0.64560456 0.74710639 0.607007510 0.79956659
## B16010_046E 1.00000000 0.85681425 0.403762289 0.56737006
## B16010_046M 0.85681425 1.00000000 0.533593477 0.66090785
## B16010_047M 0.40376229 0.53359348 1.000000000 0.53930347
## B16010_051M 0.56737006 0.66090785 0.539303472 1.00000000
## B16010_052E 0.94221661 0.81246151 0.385883151 0.55111592
## B16010_052M 0.75313393 0.82313911 0.473675368 0.59301946
## B16010_052E B16010_052M
## DRG_Cd -0.02233975 -0.02819228
## Avg_Submtd_Cvrd_Chrg 0.15270477 0.17220592
## Avg_Mdcr_Pymt_Amt 0.16279736 0.18381041
## B16010_011M 0.17708066 0.22679912
## B16010_013E 0.55893470 0.41104302
## B16010_013M 0.62713304 0.59072389
## B16010_017E -0.12712484 -0.10952713
## B16010_020M 0.56550055 0.51316140
## B16010_023E -0.14186545 -0.13556540
## B16010_023M -0.08152799 -0.03134527
## B16010_026E 0.68480840 0.50298993
## B16010_026M 0.71279312 0.62267860
## B16010_033E 0.74482026 0.55979085
## B16010_033M 0.66630476 0.61624671
## B16010_039E 0.75912269 0.56262726
## B16010_039M 0.70647512 0.64358821
## B16010_044M 0.32638128 0.41548978
## B16010_045E 0.60031445 0.55622211
## B16010_045M 0.59921628 0.64019103
## B16010_046E 0.94221661 0.75313393
## B16010_046M 0.81246151 0.82313911
## B16010_047M 0.38588315 0.47367537
## B16010_051M 0.55111592 0.59301946
## B16010_052E 1.00000000 0.84806140
## B16010_052M 0.84806140 1.00000000
# The correlation matrix
Drop the columns that are highly correlated with themselves. Our most significant predictors of Avg_Tot_Pymt_Amt are: 1. Avg_Mdcr_Pymt_Amt 2. B16010_017E 3. B16010_052M
The above three variables are the three significant predictors of Avg_Tot_Pymt_Amt
3. Modeling Approach:
Linear regression was chosen as the modeling technique due to its interpretability and ease of implementation. The dataset was split into training (70%) and testing (30%) sets to train and evaluate the model, respectively. The linear regression model was built using the training set with the following predictors: Average Medicare Payment Amount, B16010_017E, and B16010_052M.
# Load the required package
library(caret)
# Set the seed for reproducibility
set.seed(123)
# Split the dataset into training and testing sets
trainIndex <- createDataPartition(med_hos_census$Avg_Tot_Pymt_Amt, p = 0.7, list = FALSE)
train <- med_hos_census[trainIndex, ]
test <- med_hos_census[-trainIndex, ]
# Build the linear regression model using the training set
lm_model <- lm(Avg_Tot_Pymt_Amt ~ Avg_Mdcr_Pymt_Amt + B16010_017E + B16010_052M, data = train)
# Make predictions on the testing set
predictions <- predict(lm_model, test)
# Calculate the R squared
r_squared <- cor(predictions, test$Avg_Tot_Pymt_Amt)^2
# Calculate the RMSE
rmse <- sqrt(mean((predictions - test$Avg_Tot_Pymt_Amt)^2))
# Print the results
cat("R squared:", r_squared, "\n")
## R squared: 0.9475828
cat("RMSE:", rmse, "\n")
## RMSE: 1800.396
4. Model Evaluation:
The performance of the model was evaluated using two metrics: R-squared and Root Mean Squared Error (RMSE). R-squared value of 0.9476 indicates that the model explains approximately 94.76% of the variance in the total payment amount. RMSE value of $1800.40 suggests that, on average, the model’s predictions are about $1800.40 away from the actual values.
5. Conclusion:
The linear regression model developed in this project demonstrates strong predictive performance in estimating Medicare hospital payments based on the selected predictors. The high R-squared value indicates that the model captures a significant portion of the variability in the total payment amount. The relatively low RMSE value suggests that the model’s predictions are close to the actual payment amounts on average. Further refinement of the model and validation using additional datasets could provide insights into improving prediction accuracy and generalizability.
6. Recommendations:
The findings of this analysis can be valuable for healthcare providers, policymakers, and researchers in understanding the factors influencing Medicare hospital payments and optimizing payment processes. Continuous monitoring and evaluation of the model’s performance are recommended to ensure its reliability and relevance in real-world healthcare settings.
7. Future Work:
Future work could involve exploring alternative modeling techniques, such as machine learning algorithms, to potentially improve prediction accuracy and robustness. Inclusion of additional features or variables, such as hospital characteristics and regional demographics, could enhance the model’s explanatory power and predictive performance. Overall, this project provides valuable insights into Medicare hospital payments and lays the foundation for further research and analysis in this domain.
8. Reflection:
Since I embarked on the journey into the field of data science, I have constantly sought opportunities to challenge and apply the skills I have learned. However, it wasn’t until this class that I truly had the chance to put my abilities to the test on my own. In previous experiences, I often worked with datasets that were already cleaned or in good shape. This time, I embraced the challenge of sourcing raw data from websites, formatting it, cleaning it, and merging multiple datasets to derive meaningful insights.
Throughout the course, I learned valuable lessons in time management and self-directed learning. With the freedom to pursue my own objectives, I was able to effectively manage my time and focus on tasks that aligned with the course objectives. Despite the freedom, I encountered the challenge of having numerous ideas and struggling to narrow them down to a single focus. However, this process allowed me to refine my decision-making skills and prioritize tasks effectively.
I actively engaged in the class community by participating in various activities, attending lectures, completing assignments, and actively contributing to discussions on platforms like Teams. One of the highlights of my participation was collaborating with two classmates on the Mini-competition, where we achieved the honor of being the first runners-up.
Participating in the Mini-competition was a rewarding experience that not only allowed me to apply my knowledge and skills but also provided an opportunity to collaborate with peers and showcase our collective abilities. The recognition we received in the form of a branded cup serves as a cherished memento of our success in this Statistics class and further motivates me to continue actively engaging in collaborative endeavors.
Here is the Minicompetition project;
Introduction
This report presents the analysis and findings of Group 1’s participation in the Linear Regression Mini-competition. The competition involved building a linear regression model to predict sentiment scores for news headlines based on various features.
knitr::opts_chunk$set(echo = TRUE)
Libraries
The analysis was conducted using R programming language. Several libraries were utilized for data manipulation, visualization, and modeling.
library(tidymodels)
library(tidyverse)
library(dplyr)
library(lmtest)
library(car)
News <- read.csv("D:/GVSU Winter 2024/STA 631/news.csv")
summary(News)
## IDLink Title Headline Source
## Min. : 1 Length:92431 Length:92431 Length:92431
## 1st Qu.: 24551 Class :character Class :character Class :character
## Median : 52449 Mode :character Mode :character Mode :character
## Mean : 51807
## 3rd Qu.: 76784
## Max. :104802
## Topic PublishDate SentimentTitle SentimentHeadline
## Length:92431 Length:92431 Min. :-0.950694 Min. :-0.75543
## Class :character Class :character 1st Qu.:-0.079025 1st Qu.:-0.11457
## Mode :character Mode :character Median : 0.000000 Median :-0.02606
## Mean :-0.005415 Mean :-0.02750
## 3rd Qu.: 0.064385 3rd Qu.: 0.05965
## Max. : 0.962354 Max. : 0.96465
## Facebook GooglePlus LinkedIn
## Min. : -1.0 Min. : -1.000 Min. : -1.00
## 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.: 0.00
## Median : 5.0 Median : 0.000 Median : 0.00
## Mean : 114.1 Mean : 3.928 Mean : 16.69
## 3rd Qu.: 34.0 3rd Qu.: 2.000 3rd Qu.: 4.00
## Max. :49211.0 Max. :1267.000 Max. :20341.00
Test and Train Data after Splitting
The dataset was split into training and testing sets, namely “news_train.csv” and “news_test.csv,” respectively. This splitting allows for the development and evaluation of the regression model.
news_train <- read.csv("D:/GVSU Winter 2024/STA 631/news_train.csv")
news_test <- read.csv("D:/GVSU Winter 2024/STA 631/news_test.csv")
Build a Linear Model
A linear regression model was constructed using the training data. The model aimed to predict the sentiment score of news headlines based on features such as the sentiment score of the title, social media metrics (Facebook, GooglePlus, LinkedIn), and the topic of the news.
The summary of the linear model provided insights into the coefficients of the predictors, their significance levels, and the overall model fit.
model <- lm(SentimentHeadline ~ SentimentTitle + Facebook + GooglePlus + LinkedIn + Topic, data = news_train)
summary(model)
##
## Call:
## lm(formula = SentimentHeadline ~ SentimentTitle + Facebook +
## GooglePlus + LinkedIn + Topic, data = news_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.73864 -0.08560 0.00257 0.08598 0.94173
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.879e-02 8.493e-04 -45.675 <2e-16 ***
## SentimentTitle 1.863e-01 3.745e-03 49.734 <2e-16 ***
## Facebook -1.349e-06 9.660e-07 -1.397 0.163
## GooglePlus -1.491e-05 3.306e-05 -0.451 0.652
## LinkedIn 3.933e-06 3.181e-06 1.236 0.216
## Topicmicrosoft 2.377e-02 1.360e-03 17.476 <2e-16 ***
## Topicobama 2.144e-02 1.267e-03 16.929 <2e-16 ***
## Topicpalestine -2.885e-03 1.871e-03 -1.542 0.123
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1389 on 73936 degrees of freedom
## Multiple R-squared: 0.04027, Adjusted R-squared: 0.04018
## F-statistic: 443.2 on 7 and 73936 DF, p-value: < 2.2e-16
Non-linearity of the Data
To assess the assumption of linearity, a plot of residuals versus fitted values was created. Additionally, the autocorrelation function (ACF) of the residuals and the Durbin-Watson test were used to check for autocorrelation.
plot(residuals(model) ~ fitted(model), main="Residuals vs Fitted", xlab="Fitted values", ylab="Residuals")
abline(h=0, col="red")
Non-constant Variance of Error Terms (Heteroscedasticity)
A plot of residuals versus fitted values was examined to detect patterns indicating non-constant variance of error terms (heteroscedasticity).
acf(residuals((model)))
durbinWatsonTest(model)
## lag Autocorrelation D-W Statistic p-value
## 1 0.002245619 1.995487 0.526
## Alternative hypothesis: rho != 0
High Leverage Points
High leverage points, which have a considerable influence on the regression coefficients, were identified using the leverage statistic.
plot(residuals(model) ~ fitted(model), main="Residuals vs Fitted for Heteroscedasticity", xlab="Fitted values", ylab="Residuals")
Outliers
Cook’s distance plots were utilized to identify potential outliers in the dataset. Outliers can significantly impact the model’s performance and should be carefully addressed.
plot(model, which = 4)
plot(model, which = 5)
Collinearity
The variance inflation factor (VIF) was computed to assess multicollinearity among the predictor variables. VIF values below 5 indicate no significant multicollinearity concerns.
vif(model)
## GVIF Df GVIF^(1/(2*Df))
## SentimentTitle 1.003079 1 1.001538
## Facebook 1.422718 1 1.192777
## GooglePlus 1.507001 1 1.227600
## LinkedIn 1.099393 1 1.048520
## Topic 1.046166 3 1.007550
Conclusion
In conclusion, Group 1 conducted a comprehensive analysis of the dataset and developed a linear regression model for predicting sentiment scores of news headlines. Further model diagnostics were performed to ensure the model’s validity and identify potential areas for improvement. The insights gained from this analysis can guide future enhancements to the model and provide valuable insights for sentiment analysis in the context of news headlines.
[1] Huang, J. Z. (2014). An Introduction to Statistical Learning: With Applications in R By Gareth James, Trevor Hastie, Robert Tibshirani, Daniela Witten: Publisher: Springer, 2013. ISBN 978-1-4614-7137-0.
[2] Mackenzie, Andrew, et al. Predictive Healthcare Cost Modeling Using Regression. Issue 1. Society of Actuaries Conference, 2013.
[3] Penberthy, et al. ”Predictors of Medicare Costs in Elderly Beneficiaries with Breast, Colorectal, Lung, or Prostate Cancer.” Health Care Management Science, 1999.
[4] Sushmita, Shanu, et al. ”Population Cost Prediction on Public Healthcare Datasets.” Proceedings of the 5th International Conference on Digital Health 2015. ACM, 2015.