STA 631 PORTFOLIO

This is my demonstration for having met the STA 631 Course objectives;

Executive Summary This report provides a comprehensive analysis of a statistical modeling project, focusing on the application of probability theory, generalized linear models (GLMs), model selection, programming software (R), and effective communication of results to a general audience to the fulfillment of STA 631 Course objectives. The project aims to develop a predictive model for a specific data context, utilizing linear regression as the primary statistical method. Key findings and insights from the analysis are presented, along with recommendations for further exploration and refinement.

Introduction

Statistical modeling is a powerful tool for analyzing and interpreting complex data sets, providing insights into relationships between variables and making predictions based on observed patterns. Probability theory forms the foundation of statistical modeling, allowing researchers to quantify uncertainty and make probabilistic inferences about parameters of interest. In this project, I explored the application of statistical modeling techniques to a real-world data set, with a focus on linear regression as a means of predicting a continuous outcome variable.

OBJECTIVE 1: Probability as a Foundation of Statistical Modeling

Probability theory plays a fundamental role in statistical modeling, underpinning the estimation of model parameters and the assessment of uncertainty. In our analysis, we leverage probability distributions to model the variability in the data and make probabilistic statements about the parameters of our linear regression model. Specifically, we assume that the errors follow a normal distribution with a mean of zero and constant variance, which is a common assumption in linear regression modeling.

Maximum likelihood estimation (MLE) is a key concept in statistical modeling, involving the optimization of the likelihood function to estimate the parameters of the model. In our project, we employ the lm() function in R to fit the linear regression model to the data, implicitly utilizing MLE to estimate the regression coefficients. By maximizing the likelihood function, we obtain parameter estimates that best explain the observed data, allowing us to make predictions and draw inferences about the relationship between the predictor variables and the outcome variable.

OBJECTIVE 2: Application of Generalized Linear Models (GLMs)

Generalized linear models (GLMs) provide a flexible framework for modeling a wide range of data types and response distributions. In our project, we focus on fitting a linear regression model, which is a type of GLM suitable for continuous outcome variables. However, GLMs encompass a variety of modeling techniques, including logistic regression for binary outcomes and Poisson regression for count data.

The choice of the appropriate GLM depends on the nature of the data and the research question of interest. In our analysis, we carefully consider the characteristics of the outcome variable and select the linear regression model as the most appropriate approach for our data context. By understanding the principles of GLMs, we are able to effectively model the relationship between the predictor variables and the continuous outcome variable, providing valuable insights into the underlying processes.

OBJECTIVE 3: Model Selection

Model selection is a critical step in the statistical modeling process, involving the comparison of different candidate models to identify the most appropriate one based on criteria such as goodness of fit, complexity, and predictive performance. While our analysis focuses on fitting a single linear regression model, model selection remains an important consideration for future exploration.

In practice, model selection may involve fitting alternative models with different sets of predictor variables, exploring nonlinear relationships, or considering alternative distributional assumptions. By systematically comparing the performance of different models, researchers can identify the best-fitting model for their data context and draw more robust conclusions from the analysis.

OBJECTIVE 4: Use of Programming Software (R)

Programming software such as R provides a powerful platform for conducting statistical analysis and building predictive models. In these projects, I(we) leverage the capabilities of R to fit and assess the linear regression model, utilizing functions and packages specifically designed for statistical modeling.

The lm() function in R is used to fit the linear regression model to the data, providing estimates of the regression coefficients and other relevant statistics. Additionally, we utilize various R packages for data manipulation, visualization, and model diagnostics, enhancing the efficiency and reproducibility of our analysis.

OBJECTIVE 5:Communication of Results to a General Audience

Effective communication of statistical results is essential for ensuring that the findings are accessible and actionable to a general audience. In this project, I employ clear and concise language to describe the key findings and insights from the analysis, avoiding technical jargon and providing intuitive explanations of complex concepts.

Visualizations such as scatter plots, regression diagnostics, and summary tables are used to illustrate the relationships between variables and summarize the main results of the analysis. Additionally, I provide interpretations of the regression coefficients and discuss the practical implications of the findings in the context of the research question.

Analysis of Medicare Hospital Payments Project

1. Introduction: The aim of this project was to analyze Medicare hospital payment data and develop a predictive model to estimate the total payment amount based on various factors. The dataset used for analysis contains information on hospital payments, Medicare payments, and demographic variables.

Libraries

library(tidymodels)
library(tidyverse)
library(dplyr)
library(lmtest)
library(car)

Load Data

medicare_provider <- read.csv("Medicare_Inpatient_Hospital_by_Provider_and_Service_2018_data.csv")
hosp_gen_info <- read.csv("Hospital General Information.csv")
census <- read.csv("ACSDT5Y2017.B16010-Data.csv")

2. Data Preprocessing:

The dataset was first cleaned to handle missing values and outliers. Relevant variables were selected for analysis, including average Medicare payment amount, demographic variables, and hospital payment information. Correlation analysis was performed to identify highly correlated variables, and redundant variables were dropped to avoid multicollinearity.

Data Modification

#Modifying the Census Data set
census <- census[-1,]
census$GEO_ID <- substr(census$GEO_ID, 10, nchar(census$GEO_ID))
# print a concise summary of the census dataset, showing the variable names, data types, and the first few observations of each variable.

# Modifying Hospital General Information
hosp_gen_info$ZIP.Code <- as.character(hosp_gen_info$ZIP.Code)
hosp_gen_info$ZIP.Code <- ifelse(nchar(hosp_gen_info$ZIP.Code) == 5, hosp_gen_info$ZIP.Code,
                                    ifelse(nchar(hosp_gen_info$ZIP.Code) == 4, 
                                    paste0("0", ... = hosp_gen_info$ZIP.Code),
                                    paste0("00", hosp_gen_info$ZIP.Code)))

#Here we modify the ZIP.Code column in the hosp_gen_info data frame by ensuring that all values are five characters long with leading zeroes if necessary.

# Modifying medicare_provider Data
medicare_provider <- medicare_provider %>% mutate(Rndrng_Prvdr_CCN = as.character(Rndrng_Prvdr_CCN))
hosp_gen_info <- hosp_gen_info %>% mutate(Provider.ID = as.character(Provider.ID))
# Here we modify the Rndrng_Prvdr_CCN and Provider.ID columns in the medicare_provider and hosp_gen_info data frames respectively by converting their data types from numeric to character. This is usually done to allow for easier string manipulations and comparisons.

Data Merging

med_hos_census <- medicare_provider %>%
  select(Rndrng_Prvdr_CCN, Rndrng_Prvdr_State_Abrvtn, Rndrng_Prvdr_State_FIPS, Rndrng_Prvdr_Zip5, Rndrng_Prvdr_RUCA, DRG_Cd, Tot_Dschrgs,
        Avg_Submtd_Cvrd_Chrg, Avg_Tot_Pymt_Amt, Avg_Mdcr_Pymt_Amt) %>%
  inner_join(hosp_gen_info, by=c("Rndrng_Prvdr_CCN" = "Provider.ID")) %>%
  select(Rndrng_Prvdr_State_Abrvtn, Rndrng_Prvdr_State_FIPS, Rndrng_Prvdr_Zip5, Rndrng_Prvdr_RUCA, DRG_Cd, Tot_Dschrgs,
        Avg_Submtd_Cvrd_Chrg, Avg_Tot_Pymt_Amt, Avg_Mdcr_Pymt_Amt, ZIP.Code, State, Hospital.Ownership, Hospital.overall.rating) %>%
  inner_join(census, by=c("ZIP.Code" = "GEO_ID")) %>% 
  select(-NAME)
#Here we merge data from multiple data frames to create a new data frame called med_hos_census that contains information about Medicare providers, hospitals, and census data based on their zip codes.

Determining the number of ownership types of hosptitals

sort(table(med_hos_census$Hospital.Ownership), decreasing = TRUE)
## 
##              Voluntary non-profit - Private 
##                                       93321 
##                                 Proprietary 
##                                       27809 
##               Voluntary non-profit - Church 
##                                       17382 
##                Voluntary non-profit - Other 
##                                       17026 
## Government - Hospital District or Authority 
##                                       11765 
##                          Government - Local 
##                                        5397 
##                          Government - State 
##                                        2954 
##                                   Physician 
##                                         878 
##                        Government - Federal 
##                                         521 
##                                      Tribal 
##                                          19
# Here we try to determine information about the ownership types of hospitals in the med_hos_census data frame, by counting the number of occurrences of each unique value and sorting them in descending order.


#Converting the Hospital.Ownership column in the med_hos_census data frame to a factor and then to a numeric variable
med_hos_census <- med_hos_census %>% mutate(Hospital.Ownership = factor(Hospital.Ownership)) %>%
  mutate(Hospital.Ownership = as.numeric(Hospital.Ownership))

Removing unncessary variables

med_hos_census <- med_hos_census %>%
  select(-ZIP.Code, -Rndrng_Prvdr_State_Abrvtn)
# Removing the ZIP.Code and Rndrng_Prvdr_State_Abrvtn columns from the med_hos_census data frame

Removing the missing/duplicate observations

med_hos_census <- med_hos_census %>%
  filter_all(any_vars(!str_detect(Hospital.overall.rating, "Not Available")))

med_hos_census <- med_hos_census %>%
  distinct()

Changing the datatype of columns 10 to the last from character to numeric

i <- 10:117
med_hos_census[, i] <- apply(med_hos_census[, i], 2, as.numeric)
str(med_hos_census)
## 'data.frame':    176095 obs. of  117 variables:
##  $ Rndrng_Prvdr_State_FIPS: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Rndrng_Prvdr_Zip5      : int  36301 36301 36301 36301 36301 36301 36301 36301 36301 36301 ...
##  $ Rndrng_Prvdr_RUCA      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ DRG_Cd                 : int  3 23 25 38 39 57 64 65 66 69 ...
##  $ Tot_Dschrgs            : int  13 33 26 11 64 30 115 107 17 53 ...
##  $ Avg_Submtd_Cvrd_Chrg   : num  368434 148677 118718 74449 46628 ...
##  $ Avg_Tot_Pymt_Amt       : num  81541 29062 22442 9546 6468 ...
##  $ Avg_Mdcr_Pymt_Amt      : num  80435 27997 19592 7562 5073 ...
##  $ State                  : chr  "AL" "AL" "AL" "AL" ...
##  $ Hospital.Ownership     : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ Hospital.overall.rating: num  3 3 3 3 3 3 3 3 3 3 ...
##  $ B16010_001E            : num  25008 25008 25008 25008 25008 ...
##  $ B16010_001M            : num  426 426 426 426 426 426 426 426 426 426 ...
##  $ B16010_002E            : num  3926 3926 3926 3926 3926 ...
##  $ B16010_002M            : num  237 237 237 237 237 237 237 237 237 237 ...
##  $ B16010_003E            : num  1587 1587 1587 1587 1587 ...
##  $ B16010_003M            : num  186 186 186 186 186 186 186 186 186 186 ...
##  $ B16010_004E            : num  1447 1447 1447 1447 1447 ...
##  $ B16010_004M            : num  171 171 171 171 171 171 171 171 171 171 ...
##  $ B16010_005E            : num  123 123 123 123 123 123 123 123 123 123 ...
##  $ B16010_005M            : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ B16010_006E            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ B16010_006M            : num  24 24 24 24 24 24 24 24 24 24 ...
##  $ B16010_007E            : num  15 15 15 15 15 15 15 15 15 15 ...
##  $ B16010_007M            : num  13 13 13 13 13 13 13 13 13 13 ...
##  $ B16010_008E            : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ B16010_008M            : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ B16010_009E            : num  2339 2339 2339 2339 2339 ...
##  $ B16010_009M            : num  173 173 173 173 173 173 173 173 173 173 ...
##  $ B16010_010E            : num  2218 2218 2218 2218 2218 ...
##  $ B16010_010M            : num  173 173 173 173 173 173 173 173 173 173 ...
##  $ B16010_011E            : num  92 92 92 92 92 92 92 92 92 92 ...
##  $ B16010_011M            : num  39 39 39 39 39 39 39 39 39 39 ...
##  $ B16010_012E            : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ B16010_012M            : num  6 6 6 6 6 6 6 6 6 6 ...
##  $ B16010_013E            : num  22 22 22 22 22 22 22 22 22 22 ...
##  $ B16010_013M            : num  21 21 21 21 21 21 21 21 21 21 ...
##  $ B16010_014E            : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ B16010_014M            : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ B16010_015E            : num  8788 8788 8788 8788 8788 ...
##  $ B16010_015M            : num  391 391 391 391 391 391 391 391 391 391 ...
##  $ B16010_016E            : num  4916 4916 4916 4916 4916 ...
##  $ B16010_016M            : num  298 298 298 298 298 298 298 298 298 298 ...
##  $ B16010_017E            : num  4712 4712 4712 4712 4712 ...
##  $ B16010_017M            : num  287 287 287 287 287 287 287 287 287 287 ...
##  $ B16010_018E            : num  170 170 170 170 170 170 170 170 170 170 ...
##  $ B16010_018M            : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ B16010_019E            : num  4 4 4 4 4 4 4 4 4 4 ...
##  $ B16010_019M            : num  6 6 6 6 6 6 6 6 6 6 ...
##  $ B16010_020E            : num  30 30 30 30 30 30 30 30 30 30 ...
##  $ B16010_020M            : num  28 28 28 28 28 28 28 28 28 28 ...
##  $ B16010_021E            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ B16010_021M            : num  24 24 24 24 24 24 24 24 24 24 ...
##  $ B16010_022E            : num  3872 3872 3872 3872 3872 ...
##  $ B16010_022M            : num  262 262 262 262 262 262 262 262 262 262 ...
##  $ B16010_023E            : num  3838 3838 3838 3838 3838 ...
##  $ B16010_023M            : num  261 261 261 261 261 261 261 261 261 261 ...
##  $ B16010_024E            : num  9 9 9 9 9 9 9 9 9 9 ...
##  $ B16010_024M            : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ B16010_025E            : num  11 11 11 11 11 11 11 11 11 11 ...
##  $ B16010_025M            : num  12 12 12 12 12 12 12 12 12 12 ...
##  $ B16010_026E            : num  14 14 14 14 14 14 14 14 14 14 ...
##  $ B16010_026M            : num  14 14 14 14 14 14 14 14 14 14 ...
##  $ B16010_027E            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ B16010_027M            : num  24 24 24 24 24 24 24 24 24 24 ...
##  $ B16010_028E            : num  8106 8106 8106 8106 8106 ...
##  $ B16010_028M            : num  390 390 390 390 390 390 390 390 390 390 ...
##  $ B16010_029E            : num  5354 5354 5354 5354 5354 ...
##  $ B16010_029M            : num  314 314 314 314 314 314 314 314 314 314 ...
##  $ B16010_030E            : num  5185 5185 5185 5185 5185 ...
##  $ B16010_030M            : num  311 311 311 311 311 311 311 311 311 311 ...
##  $ B16010_031E            : num  137 137 137 137 137 137 137 137 137 137 ...
##  $ B16010_031M            : num  61 61 61 61 61 61 61 61 61 61 ...
##  $ B16010_032E            : num  14 14 14 14 14 14 14 14 14 14 ...
##  $ B16010_032M            : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ B16010_033E            : num  15 15 15 15 15 15 15 15 15 15 ...
##  $ B16010_033M            : num  10 10 10 10 10 10 10 10 10 10 ...
##  $ B16010_034E            : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ B16010_034M            : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ B16010_035E            : num  2752 2752 2752 2752 2752 ...
##  $ B16010_035M            : num  190 190 190 190 190 190 190 190 190 190 ...
##  $ B16010_036E            : num  2687 2687 2687 2687 2687 ...
##  $ B16010_036M            : num  182 182 182 182 182 182 182 182 182 182 ...
##  $ B16010_037E            : num  21 21 21 21 21 21 21 21 21 21 ...
##  $ B16010_037M            : num  13 13 13 13 13 13 13 13 13 13 ...
##  $ B16010_038E            : num  35 35 35 35 35 35 35 35 35 35 ...
##  $ B16010_038M            : num  22 22 22 22 22 22 22 22 22 22 ...
##  $ B16010_039E            : num  9 9 9 9 9 9 9 9 9 9 ...
##  $ B16010_039M            : num  9 9 9 9 9 9 9 9 9 9 ...
##  $ B16010_040E            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ B16010_040M            : num  24 24 24 24 24 24 24 24 24 24 ...
##  $ B16010_041E            : num  4188 4188 4188 4188 4188 ...
##  $ B16010_041M            : num  247 247 247 247 247 247 247 247 247 247 ...
##  $ B16010_042E            : num  2933 2933 2933 2933 2933 ...
##  $ B16010_042M            : num  217 217 217 217 217 217 217 217 217 217 ...
##  $ B16010_043E            : num  2851 2851 2851 2851 2851 ...
##  $ B16010_043M            : num  213 213 213 213 213 213 213 213 213 213 ...
##  $ B16010_044E            : num  31 31 31 31 31 31 31 31 31 31 ...
##  $ B16010_044M            : num  12 12 12 12 12 12 12 12 12 12 ...
##   [list output truncated]

Filtering the values for Acute Myocardial Infaction (AMI)

med_hos_census <- med_hos_census[med_hos_census$DRG_Cd %in% c("280", "281", "282", "246", "247", "248", "249", "250", "251", "252"), ]

Exploratory Data Analysis

# install.packages("maps")
library(maps)

# Group data by state and find the average amount for each state
avg_amounts <- med_hos_census %>%
  group_by(State) %>%
  summarise(avg_amount = mean(Avg_Tot_Pymt_Amt))


# use state.abb and state.name to match abbreviations to full names, and replace missing or unrecognized abbreviations with "Washington, DC"
avg_amounts$State <- ifelse(avg_amounts$State %in% state.abb, state.name[match(avg_amounts$State, state.abb)], "Washington, DC")

# Print the result
print(avg_amounts$State)
##  [1] "Alaska"         "Alabama"        "Arkansas"       "Arizona"       
##  [5] "California"     "Colorado"       "Connecticut"    "Washington, DC"
##  [9] "Delaware"       "Florida"        "Georgia"        "Hawaii"        
## [13] "Iowa"           "Idaho"          "Illinois"       "Indiana"       
## [17] "Kansas"         "Kentucky"       "Louisiana"      "Massachusetts" 
## [21] "Maryland"       "Maine"          "Michigan"       "Minnesota"     
## [25] "Missouri"       "Mississippi"    "Montana"        "North Carolina"
## [29] "North Dakota"   "Nebraska"       "New Hampshire"  "New Jersey"    
## [33] "New Mexico"     "Nevada"         "New York"       "Ohio"          
## [37] "Oklahoma"       "Oregon"         "Pennsylvania"   "Rhode Island"  
## [41] "South Carolina" "South Dakota"   "Tennessee"      "Texas"         
## [45] "Utah"           "Virginia"       "Vermont"        "Washington"    
## [49] "Wisconsin"      "West Virginia"  "Wyoming"

Data Visualization

# Convert the state names to lower
avg_amounts$State <- tolower(avg_amounts$State)
# join the map data and the avg_amounts data
map_df <- left_join(map_data("state"), avg_amounts, by = c("region" = "State"))

# Create the heatmap using ggplot2
ggplot(data = map_df, aes(x = long, y = lat, group = group, fill = avg_amount)) +
  geom_polygon() +
  scale_fill_gradient(low = "white", high = "red", name = "Average Amount") +
  theme_void()

Variable Selection

# Selecting the best subset of variables used for the modelling process
med_hos_census <- med_hos_census %>%
  select(-State)

Correlation

correlations <- cor(med_hos_census[,-med_hos_census$Avg_Tot_Pymt_Amt], med_hos_census$Avg_Tot_Pymt_Amt)
# Computing correlations

N <- 10
top_vars <- names(correlations)[order(abs(correlations), decreasing = TRUE)[1:N]]
# Selecting variables with highest absolute correlations then Sort in descending order and selecting top N variables

abs(correlations) > 0.11
##                          [,1]
## Rndrng_Prvdr_State_FIPS FALSE
## Rndrng_Prvdr_Zip5       FALSE
## Rndrng_Prvdr_RUCA       FALSE
## DRG_Cd                   TRUE
## Tot_Dschrgs             FALSE
## Avg_Submtd_Cvrd_Chrg     TRUE
## Avg_Tot_Pymt_Amt         TRUE
## Avg_Mdcr_Pymt_Amt        TRUE
## Hospital.Ownership      FALSE
## Hospital.overall.rating FALSE
## B16010_001E             FALSE
## B16010_001M             FALSE
## B16010_002E             FALSE
## B16010_002M             FALSE
## B16010_003E             FALSE
## B16010_003M             FALSE
## B16010_004E             FALSE
## B16010_004M             FALSE
## B16010_005E             FALSE
## B16010_005M             FALSE
## B16010_006E             FALSE
## B16010_006M             FALSE
## B16010_007E             FALSE
## B16010_007M             FALSE
## B16010_008E             FALSE
## B16010_008M             FALSE
## B16010_009E             FALSE
## B16010_009M             FALSE
## B16010_010E             FALSE
## B16010_010M             FALSE
## B16010_011E             FALSE
## B16010_011M              TRUE
## B16010_012E             FALSE
## B16010_012M             FALSE
## B16010_013E              TRUE
## B16010_013M              TRUE
## B16010_014E             FALSE
## B16010_014M             FALSE
## B16010_015E             FALSE
## B16010_015M             FALSE
## B16010_016E             FALSE
## B16010_016M             FALSE
## B16010_017E              TRUE
## B16010_017M             FALSE
## B16010_018E             FALSE
## B16010_018M             FALSE
## B16010_019E             FALSE
## B16010_019M             FALSE
## B16010_020E             FALSE
## B16010_020M              TRUE
## B16010_021E             FALSE
## B16010_021M             FALSE
## B16010_022E             FALSE
## B16010_022M             FALSE
## B16010_023E              TRUE
## B16010_023M              TRUE
## B16010_024E             FALSE
## B16010_024M             FALSE
## B16010_025E             FALSE
## B16010_025M             FALSE
## B16010_026E              TRUE
## B16010_026M              TRUE
## B16010_027E             FALSE
## B16010_027M             FALSE
## B16010_028E             FALSE
## B16010_028M             FALSE
## B16010_029E             FALSE
## B16010_029M             FALSE
## B16010_030E             FALSE
## B16010_030M             FALSE
## B16010_031E             FALSE
## B16010_031M             FALSE
## B16010_032E             FALSE
## B16010_032M             FALSE
## B16010_033E              TRUE
## B16010_033M              TRUE
## B16010_034E             FALSE
## B16010_034M             FALSE
## B16010_035E             FALSE
## B16010_035M             FALSE
## B16010_036E             FALSE
## B16010_036M             FALSE
## B16010_037E             FALSE
## B16010_037M             FALSE
## B16010_038E             FALSE
## B16010_038M             FALSE
## B16010_039E              TRUE
## B16010_039M              TRUE
## B16010_040E             FALSE
## B16010_040M             FALSE
## B16010_041E             FALSE
## B16010_041M             FALSE
## B16010_042E             FALSE
## B16010_042M             FALSE
## B16010_043E             FALSE
## B16010_043M             FALSE
## B16010_044E             FALSE
## B16010_044M              TRUE
## B16010_045E              TRUE
## B16010_045M              TRUE
## B16010_046E              TRUE
## B16010_046M              TRUE
## B16010_047E             FALSE
## B16010_047M              TRUE
## B16010_048E             FALSE
## B16010_048M             FALSE
## B16010_049E             FALSE
## B16010_049M             FALSE
## B16010_050E             FALSE
## B16010_050M             FALSE
## B16010_051E             FALSE
## B16010_051M              TRUE
## B16010_052E              TRUE
## B16010_052M              TRUE
## B16010_053E             FALSE
## B16010_053M             FALSE
# Filtering out correlations that are considered weak or not significant, based on a predetermined threshold value (in our case, 0.11).

Visualizing the correlations

library(ggplot2)

# Select top correlated variables
top_vars <- c("DRG_Cd", "Avg_Submtd_Cvrd_Chrg", "Avg_Mdcr_Pymt_Amt", 
"B16010_011M", "B16010_013E", "B16010_013M", "B16010_017E", "B16010_020M", 
"B16010_023E", "B16010_023M", "B16010_026E", "B16010_026M", "B16010_033E", 
"B16010_033M", "B16010_039E", "B16010_039M", "B16010_044M", "B16010_045E", 
"B16010_045M", "B16010_046E", "B16010_046M", "B16010_047M", "B16010_051M", 
"B16010_052E", "B16010_052M")

# Create scatter plots for top correlated variables
for (var in top_vars) {
  print(ggplot(med_hos_census, aes(x = Avg_Tot_Pymt_Amt, y = .data[[var]])) +
    geom_point() +
    labs(title = paste0("Scatter plot of ", var, " vs. Avg_Tot_Pymt_Amt"),
        x = "Avg_Tot_Pymt_Amt",
        y = var))
}

Checking the correlation between the top variables.

# Select top correlated variables
top_vars <- c("DRG_Cd", "Avg_Submtd_Cvrd_Chrg", "Avg_Mdcr_Pymt_Amt", 
"B16010_011M", "B16010_013E", "B16010_013M", "B16010_017E", "B16010_020M", 
"B16010_023E", "B16010_023M", "B16010_026E", "B16010_026M", "B16010_033E", 
"B16010_033M", "B16010_039E", "B16010_039M", "B16010_044M", "B16010_045E", 
"B16010_045M", "B16010_046E", "B16010_046M", "B16010_047M", "B16010_051M", 
"B16010_052E", "B16010_052M")

# Calculate correlation matrix for top variables
corr_matrix <- cor(med_hos_census[, top_vars])

# Reshape correlation matrix
corr_df <- reshape2::melt(corr_matrix)

# Create heatmap of correlation matrix
ggplot(corr_df, aes(Var2, Var1, fill = value)) +
  geom_tile() +
  scale_fill_gradient(low = "blue", high = "red") +
  labs(title = "Correlation heatmap of top variables",
    x = "Variable",
    y = "Variable",
    fill = "Correlation")

corr_matrix
##                             DRG_Cd Avg_Submtd_Cvrd_Chrg Avg_Mdcr_Pymt_Amt
## DRG_Cd                1.0000000000          -0.59504752        -0.6843227
## Avg_Submtd_Cvrd_Chrg -0.5950475186           1.00000000         0.6303394
## Avg_Mdcr_Pymt_Amt    -0.6843226998           0.63033937         1.0000000
## B16010_011M          -0.0093420444           0.19526403         0.1239541
## B16010_013E          -0.0058823638           0.05009825         0.1204442
## B16010_013M          -0.0184633360           0.11127208         0.1670829
## B16010_017E           0.0175233248          -0.10244745        -0.1012966
## B16010_020M          -0.0154880304           0.12944555         0.1228202
## B16010_023E           0.0271551045          -0.10141983        -0.1160702
## B16010_023M           0.0156219500          -0.04153485        -0.1024816
## B16010_026E          -0.0105518003           0.08227392         0.1255242
## B16010_026M          -0.0213639012           0.13328684         0.1528921
## B16010_033E          -0.0148287598           0.13377558         0.1357005
## B16010_033M          -0.0233867817           0.16023463         0.1363107
## B16010_039E          -0.0122097734           0.12600861         0.1354829
## B16010_039M          -0.0129539062           0.15360731         0.1494660
## B16010_044M          -0.0209400804           0.19836964         0.1209962
## B16010_045E           0.0002387518           0.06715876         0.1227760
## B16010_045M          -0.0179498055           0.10300747         0.1419226
## B16010_046E          -0.0206744271           0.13451410         0.1575087
## B16010_046M          -0.0311512626           0.16356878         0.1805934
## B16010_047M          -0.0193404900           0.05494069         0.1128502
## B16010_051M          -0.0130266533           0.10003926         0.1139435
## B16010_052E          -0.0223397521           0.15270477         0.1627974
## B16010_052M          -0.0281922848           0.17220592         0.1838104
##                       B16010_011M  B16010_013E  B16010_013M  B16010_017E
## DRG_Cd               -0.009342044 -0.005882364 -0.018463336  0.017523325
## Avg_Submtd_Cvrd_Chrg  0.195264027  0.050098252  0.111272078 -0.102447446
## Avg_Mdcr_Pymt_Amt     0.123954069  0.120444173  0.167082901 -0.101296638
## B16010_011M           1.000000000  0.220832062  0.348507648  0.079082277
## B16010_013E           0.220832062  1.000000000  0.802401796 -0.032519065
## B16010_013M           0.348507648  0.802401796  1.000000000  0.005674263
## B16010_017E           0.079082277 -0.032519065  0.005674263  1.000000000
## B16010_020M           0.347296803  0.680972086  0.720300816  0.095929149
## B16010_023E           0.025020734 -0.025247640 -0.021101275  0.851338726
## B16010_023M           0.152588385  0.013953141  0.069242694  0.743115601
## B16010_026E           0.228112158  0.911413965  0.747122867 -0.033544538
## B16010_026M           0.316182070  0.687411724  0.755955695  0.011342284
## B16010_033E           0.266717371  0.812131544  0.748912319 -0.012622458
## B16010_033M           0.322815409  0.619535923  0.713053042  0.040769347
## B16010_039E           0.245048815  0.825327604  0.725892046 -0.038938814
## B16010_039M           0.321924554  0.610399829  0.719529460 -0.000814911
## B16010_044M           0.654290276  0.177643171  0.326609617 -0.045663101
## B16010_045E           0.139610954  0.285631386  0.420131032 -0.098509503
## B16010_045M           0.195747560  0.278171478  0.440203378 -0.061217713
## B16010_046E           0.191190234  0.545553663  0.636534478 -0.119871077
## B16010_046M           0.256476155  0.439678753  0.630355905 -0.109811963
## B16010_047M           0.204974283  0.161054192  0.305060970  0.004850307
## B16010_051M           0.160318997  0.278894193  0.399099371 -0.067289427
## B16010_052E           0.177080659  0.558934703  0.627133041 -0.127124843
## B16010_052M           0.226799121  0.411043015  0.590723886 -0.109527128
##                      B16010_020M B16010_023E  B16010_023M  B16010_026E
## DRG_Cd               -0.01548803  0.02715510  0.015621950 -0.010551800
## Avg_Submtd_Cvrd_Chrg  0.12944555 -0.10141983 -0.041534849  0.082273923
## Avg_Mdcr_Pymt_Amt     0.12282018 -0.11607018 -0.102481580  0.125524178
## B16010_011M           0.34729680  0.02502073  0.152588385  0.228112158
## B16010_013E           0.68097209 -0.02524764  0.013953141  0.911413965
## B16010_013M           0.72030082 -0.02110128  0.069242694  0.747122867
## B16010_017E           0.09592915  0.85133873  0.743115601 -0.033544538
## B16010_020M           1.00000000  0.02944421  0.118007342  0.709129670
## B16010_023E           0.02944421  1.00000000  0.840107594 -0.036207237
## B16010_023M           0.11800734  0.84010759  1.000000000  0.008247335
## B16010_026E           0.70912967 -0.03620724  0.008247335  1.000000000
## B16010_026M           0.74401109 -0.02077294  0.069987037  0.833830765
## B16010_033E           0.73515912 -0.04465949  0.011980065  0.880398985
## B16010_033M           0.74618785 -0.02837401  0.079092166  0.693686921
## B16010_039E           0.68710095 -0.05505210  0.002266774  0.922787851
## B16010_039M           0.67235566 -0.03717064  0.079207771  0.714824263
## B16010_044M           0.35239384 -0.11557204  0.065402059  0.207343564
## B16010_045E           0.34513295 -0.10734224 -0.045640518  0.320592778
## B16010_045M           0.40274012 -0.10670798  0.012812507  0.320799883
## B16010_046E           0.57712643 -0.13993746 -0.070764215  0.660391596
## B16010_046M           0.58894715 -0.15666498 -0.030636220  0.537631680
## B16010_047M           0.31302259 -0.06441251  0.056306141  0.192017714
## B16010_051M           0.35628314 -0.05471781  0.051655225  0.316007002
## B16010_052E           0.56550055 -0.14186545 -0.081527991  0.684808396
## B16010_052M           0.51316140 -0.13556540 -0.031345269  0.502989932
##                      B16010_026M B16010_033E B16010_033M  B16010_039E
## DRG_Cd               -0.02136390 -0.01482876 -0.02338678 -0.012209773
## Avg_Submtd_Cvrd_Chrg  0.13328684  0.13377558  0.16023463  0.126008608
## Avg_Mdcr_Pymt_Amt     0.15289210  0.13570048  0.13631073  0.135482911
## B16010_011M           0.31618207  0.26671737  0.32281541  0.245048815
## B16010_013E           0.68741172  0.81213154  0.61953592  0.825327604
## B16010_013M           0.75595569  0.74891232  0.71305304  0.725892046
## B16010_017E           0.01134228 -0.01262246  0.04076935 -0.038938814
## B16010_020M           0.74401109  0.73515912  0.74618785  0.687100954
## B16010_023E          -0.02077294 -0.04465949 -0.02837401 -0.055052096
## B16010_023M           0.06998704  0.01198006  0.07909217  0.002266774
## B16010_026E           0.83383076  0.88039898  0.69368692  0.922787851
## B16010_026M           1.00000000  0.77201952  0.74645575  0.776211458
## B16010_033E           0.77201952  1.00000000  0.84481632  0.922057177
## B16010_033M           0.74645575  0.84481632  1.00000000  0.733041903
## B16010_039E           0.77621146  0.92205718  0.73304190  1.000000000
## B16010_039M           0.76152568  0.75193466  0.74518274  0.841020654
## B16010_044M           0.33491891  0.26314752  0.39693371  0.239489522
## B16010_045E           0.44087304  0.33261400  0.38720305  0.337120048
## B16010_045M           0.46710813  0.35657069  0.46850172  0.360386702
## B16010_046E           0.71162877  0.73980354  0.67257932  0.733807208
## B16010_046M           0.68653649  0.60600486  0.68393399  0.592229111
## B16010_047M           0.34353988  0.23793058  0.34845687  0.228496959
## B16010_051M           0.42197305  0.31872011  0.41375714  0.335149231
## B16010_052E           0.71279312  0.74482026  0.66630476  0.759122695
## B16010_052M           0.62267860  0.55979085  0.61624671  0.562627257
##                       B16010_039M B16010_044M   B16010_045E B16010_045M
## DRG_Cd               -0.012953906 -0.02094008  0.0002387518 -0.01794981
## Avg_Submtd_Cvrd_Chrg  0.153607309  0.19836964  0.0671587616  0.10300747
## Avg_Mdcr_Pymt_Amt     0.149465992  0.12099621  0.1227760445  0.14192264
## B16010_011M           0.321924554  0.65429028  0.1396109538  0.19574756
## B16010_013E           0.610399829  0.17764317  0.2856313857  0.27817148
## B16010_013M           0.719529460  0.32660962  0.4201310319  0.44020338
## B16010_017E          -0.000814911 -0.04566310 -0.0985095033 -0.06121771
## B16010_020M           0.672355665  0.35239384  0.3451329475  0.40274012
## B16010_023E          -0.037170642 -0.11557204 -0.1073422391 -0.10670798
## B16010_023M           0.079207771  0.06540206 -0.0456405176  0.01281251
## B16010_026E           0.714824263  0.20734356  0.3205927783  0.32079988
## B16010_026M           0.761525679  0.33491891  0.4408730367  0.46710813
## B16010_033E           0.751934665  0.26314752  0.3326140047  0.35657069
## B16010_033M           0.745182744  0.39693371  0.3872030528  0.46850172
## B16010_039E           0.841020654  0.23948952  0.3371200483  0.36038670
## B16010_039M           1.000000000  0.36011671  0.4169364683  0.47586834
## B16010_044M           0.360116711  1.00000000  0.3805215555  0.51544047
## B16010_045E           0.416936468  0.38052156  1.0000000000  0.85268067
## B16010_045M           0.475868335  0.51544047  0.8526806721  1.00000000
## B16010_046E           0.709656108  0.35517709  0.6665811800  0.64560456
## B16010_046M           0.689663940  0.50490238  0.6591301113  0.74710639
## B16010_047M           0.327867940  0.44426616  0.5053855782  0.60700751
## B16010_051M           0.429513581  0.43070554  0.7599227336  0.79956659
## B16010_052E           0.706475124  0.32638128  0.6003144474  0.59921628
## B16010_052M           0.643588208  0.41548978  0.5562221069  0.64019103
##                      B16010_046E B16010_046M  B16010_047M B16010_051M
## DRG_Cd               -0.02067443 -0.03115126 -0.019340490 -0.01302665
## Avg_Submtd_Cvrd_Chrg  0.13451410  0.16356878  0.054940690  0.10003926
## Avg_Mdcr_Pymt_Amt     0.15750867  0.18059345  0.112850240  0.11394349
## B16010_011M           0.19119023  0.25647615  0.204974283  0.16031900
## B16010_013E           0.54555366  0.43967875  0.161054192  0.27889419
## B16010_013M           0.63653448  0.63035591  0.305060970  0.39909937
## B16010_017E          -0.11987108 -0.10981196  0.004850307 -0.06728943
## B16010_020M           0.57712643  0.58894715  0.313022592  0.35628314
## B16010_023E          -0.13993746 -0.15666498 -0.064412509 -0.05471781
## B16010_023M          -0.07076421 -0.03063622  0.056306141  0.05165522
## B16010_026E           0.66039160  0.53763168  0.192017714  0.31600700
## B16010_026M           0.71162877  0.68653649  0.343539883  0.42197305
## B16010_033E           0.73980354  0.60600486  0.237930576  0.31872011
## B16010_033M           0.67257932  0.68393399  0.348456869  0.41375714
## B16010_039E           0.73380721  0.59222911  0.228496959  0.33514923
## B16010_039M           0.70965611  0.68966394  0.327867940  0.42951358
## B16010_044M           0.35517709  0.50490238  0.444266156  0.43070554
## B16010_045E           0.66658118  0.65913011  0.505385578  0.75992273
## B16010_045M           0.64560456  0.74710639  0.607007510  0.79956659
## B16010_046E           1.00000000  0.85681425  0.403762289  0.56737006
## B16010_046M           0.85681425  1.00000000  0.533593477  0.66090785
## B16010_047M           0.40376229  0.53359348  1.000000000  0.53930347
## B16010_051M           0.56737006  0.66090785  0.539303472  1.00000000
## B16010_052E           0.94221661  0.81246151  0.385883151  0.55111592
## B16010_052M           0.75313393  0.82313911  0.473675368  0.59301946
##                      B16010_052E B16010_052M
## DRG_Cd               -0.02233975 -0.02819228
## Avg_Submtd_Cvrd_Chrg  0.15270477  0.17220592
## Avg_Mdcr_Pymt_Amt     0.16279736  0.18381041
## B16010_011M           0.17708066  0.22679912
## B16010_013E           0.55893470  0.41104302
## B16010_013M           0.62713304  0.59072389
## B16010_017E          -0.12712484 -0.10952713
## B16010_020M           0.56550055  0.51316140
## B16010_023E          -0.14186545 -0.13556540
## B16010_023M          -0.08152799 -0.03134527
## B16010_026E           0.68480840  0.50298993
## B16010_026M           0.71279312  0.62267860
## B16010_033E           0.74482026  0.55979085
## B16010_033M           0.66630476  0.61624671
## B16010_039E           0.75912269  0.56262726
## B16010_039M           0.70647512  0.64358821
## B16010_044M           0.32638128  0.41548978
## B16010_045E           0.60031445  0.55622211
## B16010_045M           0.59921628  0.64019103
## B16010_046E           0.94221661  0.75313393
## B16010_046M           0.81246151  0.82313911
## B16010_047M           0.38588315  0.47367537
## B16010_051M           0.55111592  0.59301946
## B16010_052E           1.00000000  0.84806140
## B16010_052M           0.84806140  1.00000000
# The correlation matrix

Drop the columns that are highly correlated with themselves. Our most significant predictors of Avg_Tot_Pymt_Amt are: 1. Avg_Mdcr_Pymt_Amt 2. B16010_017E 3. B16010_052M

The above three variables are the three significant predictors of Avg_Tot_Pymt_Amt

3. Modeling Approach:

Linear regression was chosen as the modeling technique due to its interpretability and ease of implementation. The dataset was split into training (70%) and testing (30%) sets to train and evaluate the model, respectively. The linear regression model was built using the training set with the following predictors: Average Medicare Payment Amount, B16010_017E, and B16010_052M.

MODELLING

Linear Regression model

# Load the required package
library(caret)


# Set the seed for reproducibility
set.seed(123)

# Split the dataset into training and testing sets
trainIndex <- createDataPartition(med_hos_census$Avg_Tot_Pymt_Amt, p = 0.7, list = FALSE)
train <- med_hos_census[trainIndex, ]
test <- med_hos_census[-trainIndex, ]

# Build the linear regression model using the training set
lm_model <- lm(Avg_Tot_Pymt_Amt ~ Avg_Mdcr_Pymt_Amt + B16010_017E + B16010_052M, data = train)

# Make predictions on the testing set
predictions <- predict(lm_model, test)

# Calculate the R squared
r_squared <- cor(predictions, test$Avg_Tot_Pymt_Amt)^2

# Calculate the RMSE
rmse <- sqrt(mean((predictions - test$Avg_Tot_Pymt_Amt)^2))

# Print the results
cat("R squared:", r_squared, "\n")
## R squared: 0.9475828
cat("RMSE:", rmse, "\n")
## RMSE: 1800.396

4. Model Evaluation:

The performance of the model was evaluated using two metrics: R-squared and Root Mean Squared Error (RMSE). R-squared value of 0.9476 indicates that the model explains approximately 94.76% of the variance in the total payment amount. RMSE value of $1800.40 suggests that, on average, the model’s predictions are about $1800.40 away from the actual values.

5. Conclusion:

The linear regression model developed in this project demonstrates strong predictive performance in estimating Medicare hospital payments based on the selected predictors. The high R-squared value indicates that the model captures a significant portion of the variability in the total payment amount. The relatively low RMSE value suggests that the model’s predictions are close to the actual payment amounts on average. Further refinement of the model and validation using additional datasets could provide insights into improving prediction accuracy and generalizability.

6. Recommendations:

The findings of this analysis can be valuable for healthcare providers, policymakers, and researchers in understanding the factors influencing Medicare hospital payments and optimizing payment processes. Continuous monitoring and evaluation of the model’s performance are recommended to ensure its reliability and relevance in real-world healthcare settings.

7. Future Work:

Future work could involve exploring alternative modeling techniques, such as machine learning algorithms, to potentially improve prediction accuracy and robustness. Inclusion of additional features or variables, such as hospital characteristics and regional demographics, could enhance the model’s explanatory power and predictive performance. Overall, this project provides valuable insights into Medicare hospital payments and lays the foundation for further research and analysis in this domain.

8. Reflection:

Reflection on My Growth

Since I embarked on the journey into the field of data science, I have constantly sought opportunities to challenge and apply the skills I have learned. However, it wasn’t until this class that I truly had the chance to put my abilities to the test on my own. In previous experiences, I often worked with datasets that were already cleaned or in good shape. This time, I embraced the challenge of sourcing raw data from websites, formatting it, cleaning it, and merging multiple datasets to derive meaningful insights.

Throughout the course, I learned valuable lessons in time management and self-directed learning. With the freedom to pursue my own objectives, I was able to effectively manage my time and focus on tasks that aligned with the course objectives. Despite the freedom, I encountered the challenge of having numerous ideas and struggling to narrow them down to a single focus. However, this process allowed me to refine my decision-making skills and prioritize tasks effectively.

Reflection on Active Participation in the Course Community

I actively engaged in the class community by participating in various activities, attending lectures, completing assignments, and actively contributing to discussions on platforms like Teams. One of the highlights of my participation was collaborating with two classmates on the Mini-competition, where we achieved the honor of being the first runners-up.

Participating in the Mini-competition was a rewarding experience that not only allowed me to apply my knowledge and skills but also provided an opportunity to collaborate with peers and showcase our collective abilities. The recognition we received in the form of a branded cup serves as a cherished memento of our success in this Statistics class and further motivates me to continue actively engaging in collaborative endeavors.

Here is the Minicompetition project;

Linear Regression Mini-competition Group 1 Report

Introduction

This report presents the analysis and findings of Group 1’s participation in the Linear Regression Mini-competition. The competition involved building a linear regression model to predict sentiment scores for news headlines based on various features.

knitr::opts_chunk$set(echo = TRUE) 

Libraries

The analysis was conducted using R programming language. Several libraries were utilized for data manipulation, visualization, and modeling.

Libraries

library(tidymodels)
library(tidyverse)
library(dplyr)
library(lmtest)
library(car)

Load Data

News <- read.csv("D:/GVSU Winter 2024/STA 631/news.csv")


summary(News)
##      IDLink          Title             Headline            Source         
##  Min.   :     1   Length:92431       Length:92431       Length:92431      
##  1st Qu.: 24551   Class :character   Class :character   Class :character  
##  Median : 52449   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 51807                                                           
##  3rd Qu.: 76784                                                           
##  Max.   :104802                                                           
##     Topic           PublishDate        SentimentTitle      SentimentHeadline 
##  Length:92431       Length:92431       Min.   :-0.950694   Min.   :-0.75543  
##  Class :character   Class :character   1st Qu.:-0.079025   1st Qu.:-0.11457  
##  Mode  :character   Mode  :character   Median : 0.000000   Median :-0.02606  
##                                        Mean   :-0.005415   Mean   :-0.02750  
##                                        3rd Qu.: 0.064385   3rd Qu.: 0.05965  
##                                        Max.   : 0.962354   Max.   : 0.96465  
##     Facebook         GooglePlus          LinkedIn       
##  Min.   :   -1.0   Min.   :  -1.000   Min.   :   -1.00  
##  1st Qu.:    0.0   1st Qu.:   0.000   1st Qu.:    0.00  
##  Median :    5.0   Median :   0.000   Median :    0.00  
##  Mean   :  114.1   Mean   :   3.928   Mean   :   16.69  
##  3rd Qu.:   34.0   3rd Qu.:   2.000   3rd Qu.:    4.00  
##  Max.   :49211.0   Max.   :1267.000   Max.   :20341.00

Test and Train Data after Splitting

The dataset was split into training and testing sets, namely “news_train.csv” and “news_test.csv,” respectively. This splitting allows for the development and evaluation of the regression model.

Test and Train Data after Splitting

news_train <- read.csv("D:/GVSU Winter 2024/STA 631/news_train.csv") 
news_test <-  read.csv("D:/GVSU Winter 2024/STA 631/news_test.csv")

Build a Linear Model

A linear regression model was constructed using the training data. The model aimed to predict the sentiment score of news headlines based on features such as the sentiment score of the title, social media metrics (Facebook, GooglePlus, LinkedIn), and the topic of the news.

The summary of the linear model provided insights into the coefficients of the predictors, their significance levels, and the overall model fit.

Build a Linear Model

model <- lm(SentimentHeadline ~ SentimentTitle + Facebook + GooglePlus + LinkedIn + Topic, data = news_train)
summary(model)
## 
## Call:
## lm(formula = SentimentHeadline ~ SentimentTitle + Facebook + 
##     GooglePlus + LinkedIn + Topic, data = news_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73864 -0.08560  0.00257  0.08598  0.94173 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -3.879e-02  8.493e-04 -45.675   <2e-16 ***
## SentimentTitle  1.863e-01  3.745e-03  49.734   <2e-16 ***
## Facebook       -1.349e-06  9.660e-07  -1.397    0.163    
## GooglePlus     -1.491e-05  3.306e-05  -0.451    0.652    
## LinkedIn        3.933e-06  3.181e-06   1.236    0.216    
## Topicmicrosoft  2.377e-02  1.360e-03  17.476   <2e-16 ***
## Topicobama      2.144e-02  1.267e-03  16.929   <2e-16 ***
## Topicpalestine -2.885e-03  1.871e-03  -1.542    0.123    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1389 on 73936 degrees of freedom
## Multiple R-squared:  0.04027,    Adjusted R-squared:  0.04018 
## F-statistic: 443.2 on 7 and 73936 DF,  p-value: < 2.2e-16

Non-linearity of the Data

To assess the assumption of linearity, a plot of residuals versus fitted values was created. Additionally, the autocorrelation function (ACF) of the residuals and the Durbin-Watson test were used to check for autocorrelation.

Non-linearity of the Data

plot(residuals(model) ~ fitted(model), main="Residuals vs Fitted", xlab="Fitted values", ylab="Residuals")
abline(h=0, col="red")

Non-constant Variance of Error Terms (Heteroscedasticity)

A plot of residuals versus fitted values was examined to detect patterns indicating non-constant variance of error terms (heteroscedasticity).

acf(residuals((model)))

durbinWatsonTest(model)
##  lag Autocorrelation D-W Statistic p-value
##    1     0.002245619      1.995487   0.526
##  Alternative hypothesis: rho != 0

High Leverage Points

High leverage points, which have a considerable influence on the regression coefficients, were identified using the leverage statistic.

Non-constant Variance of Error Terms (Heteroscedasticity):

plot(residuals(model) ~ fitted(model), main="Residuals vs Fitted for Heteroscedasticity", xlab="Fitted values", ylab="Residuals")

Outliers

Cook’s distance plots were utilized to identify potential outliers in the dataset. Outliers can significantly impact the model’s performance and should be carefully addressed.

Outliers: To identify outliers, we use Cook’s distance plots.

plot(model, which = 4)

High Leverage Points:Using Leverage Statistic

plot(model, which = 5)

Collinearity

The variance inflation factor (VIF) was computed to assess multicollinearity among the predictor variables. VIF values below 5 indicate no significant multicollinearity concerns.

Collinearity:Using Variance Inflation Factor (VIF)

vif(model)
##                    GVIF Df GVIF^(1/(2*Df))
## SentimentTitle 1.003079  1        1.001538
## Facebook       1.422718  1        1.192777
## GooglePlus     1.507001  1        1.227600
## LinkedIn       1.099393  1        1.048520
## Topic          1.046166  3        1.007550

Conclusion

In conclusion, Group 1 conducted a comprehensive analysis of the dataset and developed a linear regression model for predicting sentiment scores of news headlines. Further model diagnostics were performed to ensure the model’s validity and identify potential areas for improvement. The insights gained from this analysis can guide future enhancements to the model and provide valuable insights for sentiment analysis in the context of news headlines.

References

[1] Huang, J. Z. (2014). An Introduction to Statistical Learning: With Applications in R By Gareth James, Trevor Hastie, Robert Tibshirani, Daniela Witten: Publisher: Springer, 2013. ISBN 978-1-4614-7137-0.

[2] Mackenzie, Andrew, et al. Predictive Healthcare Cost Modeling Using Regression. Issue 1. Society of Actuaries Conference, 2013.

[3] Penberthy, et al. ”Predictors of Medicare Costs in Elderly Beneficiaries with Breast, Colorectal, Lung, or Prostate Cancer.” Health Care Management Science, 1999.

[4] Sushmita, Shanu, et al. ”Population Cost Prediction on Public Healthcare Datasets.” Proceedings of the 5th International Conference on Digital Health 2015. ACM, 2015.