Preparing Data
#Preparing data
heartatk4R <- read.delim("C:/Users/lokra_000/Desktop/week3/heartatk4R.txt")
x= heartatk4R
str(x)
## 'data.frame': 12844 obs. of 8 variables:
## $ Patient : int 1 2 3 4 5 6 7 8 9 10 ...
## $ DIAGNOSIS: int 41041 41041 41091 41081 41091 41091 41091 41091 41041 41041 ...
## $ SEX : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 1 1 2 1 ...
## $ DRG : int 122 122 122 122 122 121 121 121 121 123 ...
## $ DIED : int 0 0 0 0 0 0 0 0 0 1 ...
## $ CHARGES : num 4752 3941 3657 1481 1681 ...
## $ LOS : int 10 6 5 2 1 9 15 15 2 1 ...
## $ AGE : int 79 34 76 80 55 84 84 70 76 65 ...
x$DIAGNOSIS = as.factor(x$DIAGNOSIS)
x$DRG= as.factor (x$DRG)
x$DIED= as.factor (x$DIED)
str (x)
## 'data.frame': 12844 obs. of 8 variables:
## $ Patient : int 1 2 3 4 5 6 7 8 9 10 ...
## $ DIAGNOSIS: Factor w/ 9 levels "41001","41011",..: 5 5 9 8 9 9 9 9 5 5 ...
## $ SEX : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 1 1 2 1 ...
## $ DRG : Factor w/ 3 levels "121","122","123": 2 2 2 2 2 1 1 1 1 3 ...
## $ DIED : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
## $ CHARGES : num 4752 3941 3657 1481 1681 ...
## $ LOS : int 10 6 5 2 1 9 15 15 2 1 ...
## $ AGE : int 79 34 76 80 55 84 84 70 76 65 ...
attach (x)
counts= table (DIAGNOSIS, SEX)
counts
## SEX
## DIAGNOSIS F M
## 41001 175 292
## 41011 692 1132
## 41021 100 150
## 41031 96 185
## 41041 998 1667
## 41051 69 85
## 41071 727 976
## 41081 130 157
## 41091 2078 3135
counts/rowSums (counts)
## SEX
## DIAGNOSIS F M
## 41001 0.3747323 0.6252677
## 41011 0.3793860 0.6206140
## 41021 0.4000000 0.6000000
## 41031 0.3416370 0.6583630
## 41041 0.3744841 0.6255159
## 41051 0.4480519 0.5519481
## 41071 0.4268937 0.5731063
## 41081 0.4529617 0.5470383
## 41091 0.3986188 0.6013812
barplot (counts,legend= rownames(counts),col=rainbow(10),xlab="SEX",beside=T)
mosaicplot (table (SEX,DIAGNOSIS),color=rainbow(9))
chisq.test (counts)
##
## Pearson's Chi-squared test
##
## data: counts
## X-squared = 24.106, df = 8, p-value = 0.0022
INTERPRETATION
Chi-square test null hypothesis - There is no difference in diagnosis code according to sex. Since, the P value of the Chi-square test is less than 0.01 (P value =0.0022), we reject this null hypothesis. So, there is significant difference in diagnosis code in between male and female patients. This can also be seen on the tables and bargraph, where for all the diagnosis codes there are more male patients in comparison to female patients.
CHALLENGE 2
library(lattice)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.5
histogram(~AGE|factor(DIED),layout=c(1,2))
ggplot(x,aes(AGE,fill=DIED))+geom_density(alpha=.3)+theme_classic()
boxplot(AGE~DIED, col=rainbow(2),xlab="DIED",ylab="AGE")
t.test(AGE~DIED)
##
## Welch Two Sample t-test
##
## data: AGE by DIED
## t = -31.797, df = 1927.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.106098 -9.815686
## sample estimates:
## mean in group 0 mean in group 1
## 65.14195 75.60284
INTERPRETATION
From the t-test we can say that there is significant difference in mean age of people who die and who survive. Form the t-test we can say with 95% confidence that mean age of patients who survive is 9.8 to 11 years less than the mean age of patient who die. From the histogram too we can see that there are more older patients in died group. The right sided skewed density plot of died also indicates similar result. Again, we can see in the boxplot that median age of died patients is higher than that of survived one.
CHALLENGE 3
boxplot (CHARGES~DRG, col=rainbow(3),
xlab="DRG",ylab="CHARGES",main="Charges by DRG Code")
library(ggplot2)
ggplot(x,aes(CHARGES,fill=DRG))+geom_density(alpha=.3)+theme_classic()
## Warning: Removed 699 rows containing non-finite values (stat_density).
histogram(~CHARGES|factor(DRG),layout=c(1,3))
#ANOVA
comp=aov(CHARGES~DRG,x)
summary(comp)
## Df Sum Sq Mean Sq F value Pr(>F)
## DRG 2 2.750e+10 1.375e+10 337.3 <2e-16 ***
## Residuals 12142 4.948e+11 4.076e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 699 observations deleted due to missingness
#Pair-wise comparison
TukeyHSD(comp,"DRG")
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = CHARGES ~ DRG, data = x)
##
## $DRG
## diff lwr upr p adj
## 122-121 -2937.8082 -3226.2396 -2649.37672 0.0000000
## 123-121 -3455.2386 -3915.4716 -2995.00562 0.0000000
## 123-122 -517.4304 -971.9504 -62.91046 0.0208405
#Since the data is not normal using Kruskal-Wallis Test
kruskal.test(CHARGES~DRG,x)
##
## Kruskal-Wallis rank sum test
##
## data: CHARGES by DRG
## Kruskal-Wallis chi-squared = 865.93, df = 2, p-value < 2.2e-16
INTERPRETATION
From the boxplot it can bee seen that the median charges are different among three groups. Similar result can be seen from density plot and histogram. Charges are higher for people with complications (DRG 121). Whereas there are very few people who died (DRG 123) who are paying higher charges as seen in density plot and histogram.
From ANOVA we saw that there is significant difference in charges among the three groups. I did pairwise comparison using Tukey Test to know which groups are actually different. All the groups are significantly different from each other in terms of charges they pay.
Since the data is not normal. I tried using non parametic test. I got the similar result as ANOVA, that there is differnce in charges among three diagnosis groups.
CHALLENGE 4
bwplot(AGE~SEX|DIED)
INTERPRETATION *From the figure it can be said that females are older than males in both died and survived category. Also, it shows that both surving women and men are younger than the women and men who died respectively.
CHALLENGE 5
#Predicting length of stay (LOS)
fit=lm(LOS~SEX+AGE+DIAGNOSIS+DRG)
summary(fit)
##
## Call:
## lm(formula = LOS ~ SEX + AGE + DIAGNOSIS + DRG)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.105 -3.071 -0.659 1.932 33.736
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.266789 0.334553 15.743 < 2e-16 ***
## SEXM -0.630212 0.090281 -6.981 3.09e-12 ***
## AGE 0.070454 0.003404 20.700 < 2e-16 ***
## DIAGNOSIS41011 -0.220719 0.247110 -0.893 0.37177
## DIAGNOSIS41021 -0.157027 0.373479 -0.420 0.67417
## DIAGNOSIS41031 -1.039620 0.359964 -2.888 0.00388 **
## DIAGNOSIS41041 -0.635260 0.239501 -2.652 0.00800 **
## DIAGNOSIS41051 -0.601414 0.442687 -1.359 0.17431
## DIAGNOSIS41071 -0.684724 0.250086 -2.738 0.00619 **
## DIAGNOSIS41081 -0.905362 0.357436 -2.533 0.01132 *
## DIAGNOSIS41091 -0.499623 0.230380 -2.169 0.03012 *
## DRG122 -1.916567 0.091761 -20.887 < 2e-16 ***
## DRG123 -5.234036 0.145148 -36.060 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.762 on 12831 degrees of freedom
## Multiple R-squared: 0.1339, Adjusted R-squared: 0.1331
## F-statistic: 165.3 on 12 and 12831 DF, p-value: < 2.2e-16
INTERPRETATION This model says that Sex, Age, DRG, and some of the diagnosis level are important in deciding length of stay. If you are a male the length of stay will decrease by 0.6 day if other factors are kept constant. Similarly, if your age incrase by 1, then the length of stay will also increase by 0.07 keeping all factors constant. Diagnosis 41011, 41021,41051 do not have significant effect on length of stay. But, if you are diagnosed with 41031, you will be staying 1.03 days shorter than the patients with 41001 keeping other factors constant. Similarly, if you do not have heart complications will have to stay 1.9 days less than the person having heart complications.
#Predicting heart complication
#Make subset of data that contains only DRG121 and DRG 122, omitting dead
challenge5=subset(x,DRG !=123, select=c(AGE,LOS,DIAGNOSIS,SEX,CHARGES,DIED,DRG))
count3=(table(challenge5$DRG,challenge5$DIED))
count3 #All the dead patients removed
##
## 0 1
## 121 5387 0
## 122 6047 0
## 123 0 0
fit1=glm(challenge5$DRG~challenge5$SEX+
challenge5$AGE+challenge5$LOS+
challenge5$DIAGNOSIS+
challenge5$CHARGES,data=challenge5, family=binomial())
summary(fit1)
##
## Call:
## glm(formula = challenge5$DRG ~ challenge5$SEX + challenge5$AGE +
## challenge5$LOS + challenge5$DIAGNOSIS + challenge5$CHARGES,
## family = binomial(), data = challenge5)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0787 -1.1254 0.6937 1.0402 2.4361
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.787e+00 1.680e-01 16.583 < 2e-16 ***
## challenge5$SEXM 1.373e-01 4.382e-02 3.132 0.00174 **
## challenge5$AGE -3.070e-02 1.705e-03 -18.001 < 2e-16 ***
## challenge5$LOS -3.920e-02 6.793e-03 -5.770 7.93e-09 ***
## challenge5$DIAGNOSIS41011 5.648e-02 1.258e-01 0.449 0.65335
## challenge5$DIAGNOSIS41021 4.437e-01 1.898e-01 2.337 0.01942 *
## challenge5$DIAGNOSIS41031 3.441e-01 1.774e-01 1.940 0.05241 .
## challenge5$DIAGNOSIS41041 2.319e-01 1.214e-01 1.910 0.05616 .
## challenge5$DIAGNOSIS41051 1.618e-01 2.216e-01 0.730 0.46534
## challenge5$DIAGNOSIS41071 2.399e-01 1.260e-01 1.904 0.05687 .
## challenge5$DIAGNOSIS41081 2.583e-01 1.823e-01 1.417 0.15662
## challenge5$DIAGNOSIS41091 5.463e-02 1.178e-01 0.464 0.64280
## challenge5$CHARGES -5.703e-05 5.124e-06 -11.130 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14942 on 10808 degrees of freedom
## Residual deviance: 13730 on 10796 degrees of freedom
## (625 observations deleted due to missingness)
## AIC: 13756
##
## Number of Fisher Scoring iterations: 4
#Remove charges, it is highly corelated with LOS
fit2=glm(challenge5$DRG~challenge5$SEX+
challenge5$AGE+challenge5$LOS+
challenge5$DIAGNOSIS,
data=challenge5, family=binomial())
summary(fit2)
##
## Call:
## glm(formula = challenge5$DRG ~ challenge5$SEX + challenge5$AGE +
## challenge5$LOS + challenge5$DIAGNOSIS, family = binomial(),
## data = challenge5)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0018 -1.1333 0.7247 1.0487 2.6010
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.334477 0.157472 14.825 < 2e-16 ***
## challenge5$SEXM 0.134810 0.042315 3.186 0.00144 **
## challenge5$AGE -0.027622 0.001621 -17.035 < 2e-16 ***
## challenge5$LOS -0.093080 0.004706 -19.779 < 2e-16 ***
## challenge5$DIAGNOSIS41011 0.132677 0.120756 1.099 0.27189
## challenge5$DIAGNOSIS41021 0.493566 0.181563 2.718 0.00656 **
## challenge5$DIAGNOSIS41031 0.423061 0.171797 2.463 0.01379 *
## challenge5$DIAGNOSIS41041 0.307855 0.116504 2.642 0.00823 **
## challenge5$DIAGNOSIS41051 0.281370 0.215406 1.306 0.19148
## challenge5$DIAGNOSIS41071 0.359256 0.120829 2.973 0.00295 **
## challenge5$DIAGNOSIS41081 0.310927 0.174938 1.777 0.07551 .
## challenge5$DIAGNOSIS41091 0.174704 0.112827 1.548 0.12152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 15813 on 11433 degrees of freedom
## Residual deviance: 14672 on 11422 degrees of freedom
## AIC: 14696
##
## Number of Fisher Scoring iterations: 4
Interpretation *When we removed charges, we saw that we were some other significant factors seen (Eg. 41031). High corelation between LOS and CHARGES was masking those effect.
This logistic model shows the probability of getting heart complications. Here, 0 is for having complications and 1 is for not having complications. A negative coefficint indicates likey to have complications.
Here we can see AGE and LOS have negative coefficients. SO, if you are older there is higher probability of having heart complications keeping other factors constant. If you are male you are less likely to have heart complications when compared compared to female of same age keeping other factors constant.IF you are diagnosed with 41021 you are less likey to have heart complications in comaparison to the people who are diagnosed with 41001.