Preparing Data

#Preparing data 
heartatk4R <- read.delim("C:/Users/lokra_000/Desktop/week3/heartatk4R.txt")
x= heartatk4R
str(x)
## 'data.frame':    12844 obs. of  8 variables:
##  $ Patient  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ DIAGNOSIS: int  41041 41041 41091 41081 41091 41091 41091 41091 41041 41041 ...
##  $ SEX      : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 1 1 2 1 ...
##  $ DRG      : int  122 122 122 122 122 121 121 121 121 123 ...
##  $ DIED     : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ CHARGES  : num  4752 3941 3657 1481 1681 ...
##  $ LOS      : int  10 6 5 2 1 9 15 15 2 1 ...
##  $ AGE      : int  79 34 76 80 55 84 84 70 76 65 ...
x$DIAGNOSIS = as.factor(x$DIAGNOSIS)
x$DRG= as.factor (x$DRG)
x$DIED= as.factor (x$DIED)
str (x)
## 'data.frame':    12844 obs. of  8 variables:
##  $ Patient  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ DIAGNOSIS: Factor w/ 9 levels "41001","41011",..: 5 5 9 8 9 9 9 9 5 5 ...
##  $ SEX      : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 1 1 2 1 ...
##  $ DRG      : Factor w/ 3 levels "121","122","123": 2 2 2 2 2 1 1 1 1 3 ...
##  $ DIED     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
##  $ CHARGES  : num  4752 3941 3657 1481 1681 ...
##  $ LOS      : int  10 6 5 2 1 9 15 15 2 1 ...
##  $ AGE      : int  79 34 76 80 55 84 84 70 76 65 ...
attach (x)
  1. CHALLENGE 1 Investigate weather men and women differ in their diagosis codes.
counts= table (DIAGNOSIS, SEX)
counts
##          SEX
## DIAGNOSIS    F    M
##     41001  175  292
##     41011  692 1132
##     41021  100  150
##     41031   96  185
##     41041  998 1667
##     41051   69   85
##     41071  727  976
##     41081  130  157
##     41091 2078 3135
counts/rowSums (counts)
##          SEX
## DIAGNOSIS         F         M
##     41001 0.3747323 0.6252677
##     41011 0.3793860 0.6206140
##     41021 0.4000000 0.6000000
##     41031 0.3416370 0.6583630
##     41041 0.3744841 0.6255159
##     41051 0.4480519 0.5519481
##     41071 0.4268937 0.5731063
##     41081 0.4529617 0.5470383
##     41091 0.3986188 0.6013812
barplot (counts,legend= rownames(counts),col=rainbow(10),xlab="SEX",beside=T)

mosaicplot (table (SEX,DIAGNOSIS),color=rainbow(9))

chisq.test (counts)
## 
##  Pearson's Chi-squared test
## 
## data:  counts
## X-squared = 24.106, df = 8, p-value = 0.0022

INTERPRETATION

Chi-square test null hypothesis - There is no difference in diagnosis code according to sex. Since, the P value of the Chi-square test is less than 0.01 (P value =0.0022), we reject this null hypothesis. So, there is significant difference in diagnosis code in between male and female patients. This can also be seen on the tables and bargraph, where for all the diagnosis codes there are more male patients in comparison to female patients.

CHALLENGE 2

library(lattice)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.5
histogram(~AGE|factor(DIED),layout=c(1,2))

ggplot(x,aes(AGE,fill=DIED))+geom_density(alpha=.3)+theme_classic()

boxplot(AGE~DIED, col=rainbow(2),xlab="DIED",ylab="AGE")           

t.test(AGE~DIED)
## 
##  Welch Two Sample t-test
## 
## data:  AGE by DIED
## t = -31.797, df = 1927.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.106098  -9.815686
## sample estimates:
## mean in group 0 mean in group 1 
##        65.14195        75.60284

INTERPRETATION

From the t-test we can say that there is significant difference in mean age of people who die and who survive. Form the t-test we can say with 95% confidence that mean age of patients who survive is 9.8 to 11 years less than the mean age of patient who die. From the histogram too we can see that there are more older patients in died group. The right sided skewed density plot of died also indicates similar result. Again, we can see in the boxplot that median age of died patients is higher than that of survived one.

CHALLENGE 3

boxplot (CHARGES~DRG, col=rainbow(3),
         xlab="DRG",ylab="CHARGES",main="Charges by DRG Code")

library(ggplot2)
ggplot(x,aes(CHARGES,fill=DRG))+geom_density(alpha=.3)+theme_classic()
## Warning: Removed 699 rows containing non-finite values (stat_density).

histogram(~CHARGES|factor(DRG),layout=c(1,3))

#ANOVA
comp=aov(CHARGES~DRG,x)
summary(comp)
##                Df    Sum Sq   Mean Sq F value Pr(>F)    
## DRG             2 2.750e+10 1.375e+10   337.3 <2e-16 ***
## Residuals   12142 4.948e+11 4.076e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 699 observations deleted due to missingness
#Pair-wise comparison
TukeyHSD(comp,"DRG")
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = CHARGES ~ DRG, data = x)
## 
## $DRG
##               diff        lwr         upr     p adj
## 122-121 -2937.8082 -3226.2396 -2649.37672 0.0000000
## 123-121 -3455.2386 -3915.4716 -2995.00562 0.0000000
## 123-122  -517.4304  -971.9504   -62.91046 0.0208405
#Since the data is not normal using Kruskal-Wallis Test
kruskal.test(CHARGES~DRG,x)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  CHARGES by DRG
## Kruskal-Wallis chi-squared = 865.93, df = 2, p-value < 2.2e-16

INTERPRETATION

From the boxplot it can bee seen that the median charges are different among three groups. Similar result can be seen from density plot and histogram. Charges are higher for people with complications (DRG 121). Whereas there are very few people who died (DRG 123) who are paying higher charges as seen in density plot and histogram.

From ANOVA we saw that there is significant difference in charges among the three groups. I did pairwise comparison using Tukey Test to know which groups are actually different. All the groups are significantly different from each other in terms of charges they pay.

Since the data is not normal. I tried using non parametic test. I got the similar result as ANOVA, that there is differnce in charges among three diagnosis groups.

CHALLENGE 4

bwplot(AGE~SEX|DIED)

INTERPRETATION *From the figure it can be said that females are older than males in both died and survived category. Also, it shows that both surving women and men are younger than the women and men who died respectively.

CHALLENGE 5

#Predicting length of stay (LOS)
fit=lm(LOS~SEX+AGE+DIAGNOSIS+DRG)
summary(fit)
## 
## Call:
## lm(formula = LOS ~ SEX + AGE + DIAGNOSIS + DRG)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.105  -3.071  -0.659   1.932  33.736 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     5.266789   0.334553  15.743  < 2e-16 ***
## SEXM           -0.630212   0.090281  -6.981 3.09e-12 ***
## AGE             0.070454   0.003404  20.700  < 2e-16 ***
## DIAGNOSIS41011 -0.220719   0.247110  -0.893  0.37177    
## DIAGNOSIS41021 -0.157027   0.373479  -0.420  0.67417    
## DIAGNOSIS41031 -1.039620   0.359964  -2.888  0.00388 ** 
## DIAGNOSIS41041 -0.635260   0.239501  -2.652  0.00800 ** 
## DIAGNOSIS41051 -0.601414   0.442687  -1.359  0.17431    
## DIAGNOSIS41071 -0.684724   0.250086  -2.738  0.00619 ** 
## DIAGNOSIS41081 -0.905362   0.357436  -2.533  0.01132 *  
## DIAGNOSIS41091 -0.499623   0.230380  -2.169  0.03012 *  
## DRG122         -1.916567   0.091761 -20.887  < 2e-16 ***
## DRG123         -5.234036   0.145148 -36.060  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.762 on 12831 degrees of freedom
## Multiple R-squared:  0.1339, Adjusted R-squared:  0.1331 
## F-statistic: 165.3 on 12 and 12831 DF,  p-value: < 2.2e-16

INTERPRETATION This model says that Sex, Age, DRG, and some of the diagnosis level are important in deciding length of stay. If you are a male the length of stay will decrease by 0.6 day if other factors are kept constant. Similarly, if your age incrase by 1, then the length of stay will also increase by 0.07 keeping all factors constant. Diagnosis 41011, 41021,41051 do not have significant effect on length of stay. But, if you are diagnosed with 41031, you will be staying 1.03 days shorter than the patients with 41001 keeping other factors constant. Similarly, if you do not have heart complications will have to stay 1.9 days less than the person having heart complications.


#Predicting heart complication
#Make subset of data that contains only DRG121 and DRG 122, omitting dead
challenge5=subset(x,DRG !=123, select=c(AGE,LOS,DIAGNOSIS,SEX,CHARGES,DIED,DRG))
count3=(table(challenge5$DRG,challenge5$DIED))
count3 #All the dead patients removed
##      
##          0    1
##   121 5387    0
##   122 6047    0
##   123    0    0
fit1=glm(challenge5$DRG~challenge5$SEX+
          challenge5$AGE+challenge5$LOS+
          challenge5$DIAGNOSIS+
          challenge5$CHARGES,data=challenge5, family=binomial())              
summary(fit1)
## 
## Call:
## glm(formula = challenge5$DRG ~ challenge5$SEX + challenge5$AGE + 
##     challenge5$LOS + challenge5$DIAGNOSIS + challenge5$CHARGES, 
##     family = binomial(), data = challenge5)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0787  -1.1254   0.6937   1.0402   2.4361  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                2.787e+00  1.680e-01  16.583  < 2e-16 ***
## challenge5$SEXM            1.373e-01  4.382e-02   3.132  0.00174 ** 
## challenge5$AGE            -3.070e-02  1.705e-03 -18.001  < 2e-16 ***
## challenge5$LOS            -3.920e-02  6.793e-03  -5.770 7.93e-09 ***
## challenge5$DIAGNOSIS41011  5.648e-02  1.258e-01   0.449  0.65335    
## challenge5$DIAGNOSIS41021  4.437e-01  1.898e-01   2.337  0.01942 *  
## challenge5$DIAGNOSIS41031  3.441e-01  1.774e-01   1.940  0.05241 .  
## challenge5$DIAGNOSIS41041  2.319e-01  1.214e-01   1.910  0.05616 .  
## challenge5$DIAGNOSIS41051  1.618e-01  2.216e-01   0.730  0.46534    
## challenge5$DIAGNOSIS41071  2.399e-01  1.260e-01   1.904  0.05687 .  
## challenge5$DIAGNOSIS41081  2.583e-01  1.823e-01   1.417  0.15662    
## challenge5$DIAGNOSIS41091  5.463e-02  1.178e-01   0.464  0.64280    
## challenge5$CHARGES        -5.703e-05  5.124e-06 -11.130  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 14942  on 10808  degrees of freedom
## Residual deviance: 13730  on 10796  degrees of freedom
##   (625 observations deleted due to missingness)
## AIC: 13756
## 
## Number of Fisher Scoring iterations: 4
#Remove charges, it is highly corelated with LOS
fit2=glm(challenge5$DRG~challenge5$SEX+
          challenge5$AGE+challenge5$LOS+
          challenge5$DIAGNOSIS,
          data=challenge5, family=binomial())              
summary(fit2)
## 
## Call:
## glm(formula = challenge5$DRG ~ challenge5$SEX + challenge5$AGE + 
##     challenge5$LOS + challenge5$DIAGNOSIS, family = binomial(), 
##     data = challenge5)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0018  -1.1333   0.7247   1.0487   2.6010  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                2.334477   0.157472  14.825  < 2e-16 ***
## challenge5$SEXM            0.134810   0.042315   3.186  0.00144 ** 
## challenge5$AGE            -0.027622   0.001621 -17.035  < 2e-16 ***
## challenge5$LOS            -0.093080   0.004706 -19.779  < 2e-16 ***
## challenge5$DIAGNOSIS41011  0.132677   0.120756   1.099  0.27189    
## challenge5$DIAGNOSIS41021  0.493566   0.181563   2.718  0.00656 ** 
## challenge5$DIAGNOSIS41031  0.423061   0.171797   2.463  0.01379 *  
## challenge5$DIAGNOSIS41041  0.307855   0.116504   2.642  0.00823 ** 
## challenge5$DIAGNOSIS41051  0.281370   0.215406   1.306  0.19148    
## challenge5$DIAGNOSIS41071  0.359256   0.120829   2.973  0.00295 ** 
## challenge5$DIAGNOSIS41081  0.310927   0.174938   1.777  0.07551 .  
## challenge5$DIAGNOSIS41091  0.174704   0.112827   1.548  0.12152    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 15813  on 11433  degrees of freedom
## Residual deviance: 14672  on 11422  degrees of freedom
## AIC: 14696
## 
## Number of Fisher Scoring iterations: 4

Interpretation *When we removed charges, we saw that we were some other significant factors seen (Eg. 41031). High corelation between LOS and CHARGES was masking those effect.

This logistic model shows the probability of getting heart complications. Here, 0 is for having complications and 1 is for not having complications. A negative coefficint indicates likey to have complications.

Here we can see AGE and LOS have negative coefficients. SO, if you are older there is higher probability of having heart complications keeping other factors constant. If you are male you are less likely to have heart complications when compared compared to female of same age keeping other factors constant.IF you are diagnosed with 41021 you are less likey to have heart complications in comaparison to the people who are diagnosed with 41001.