1 R语言介绍

1.1R语言历史

R是S语言的一种实现。S语言是由 AT&T贝尔实验室开发的一种用来进行数据探索、统计分析、作图的解释型语言。

最初S语言的实现版本主要是S-PLUS。S-PLUS是一个商业软件,它基于S语言,并由MathSoft公司的统计科学部进一步完善

后来Auckland大学的Robert Gentleman 和 Ross Ihaka 及其他志愿人员开发了一个R系统。

R的使用与S-PLUS有很多类似之处,两个软件有一定的兼容性。

1.2R的特点

1.有效的数据处理和保存机制。

2.拥有一整套数组和矩阵的操作运算符。

3.一系列连贯而又完整的数据分析中间工具。

4.图形统计可以对数据直接进行分析和显示,可用于多种图形设备。

5.一种相当完善、简洁和高效的程序设计语言。它包括条件语句、循环语句、用户自定义的递归函数以及输入输出接口。

6.R语言是彻底面向对象的统计编程语言。

7.R语言和其它编程语言、数据库之间有很好的接口。

8.R语言是自由软件,可以放心大胆地使用,但其功能却不比任何其它同类软件差。

9.R语言具有丰富的网上资源

2 Rstudio和R的基本操作

2.1工作目录和工作空间

显示当前的工作目录

getwd()

修改当前的工作目录

setwd()

列出当前工作空间中的对象

ls()

移除(删除)一个或多个对象

rm(objectlist)

保存工作空间到文件myfile中(默认后缀 .RData)

save.image(“myfile”)

保存指定对象到一个文件中

save(object, file=“myfile”)

读取一个工作空间到当前会话中

load(“myfile”)

2.2创建向量和矩阵,以及对它们的基本操作

数据结构:标量、向量、矩阵、数组、数据框、列表

标量与向量:

标量只含一个元素,向量为多个相同模式的标量组成的一维数组,使用c()进行组合

x1<-c(1,2,3,4,5,6)
x2<-x1*2
length(x1)
## [1] 6
class(x1)
## [1] "numeric"
矩阵:

相同模式元素的二维数组,通过matrix()创建

rbind(x1,x2)
##    [,1] [,2] [,3] [,4] [,5] [,6]
## x1    1    2    3    4    5    6
## x2    2    4    6    8   10   12
cbind(x1,x2)
##      x1 x2
## [1,]  1  2
## [2,]  2  4
## [3,]  3  6
## [4,]  4  8
## [5,]  5 10
## [6,]  6 12
Dmat<-cbind(x1,x2);Dmat
##      x1 x2
## [1,]  1  2
## [2,]  2  4
## [3,]  3  6
## [4,]  4  8
## [5,]  5 10
## [6,]  6 12
class(Dmat)                 
## [1] "matrix"
Dmat2<-rbind(x1,x2)
class(Dmat2)
## [1] "matrix"

使用c函数结合的变量组成的是矩阵

数组:

维度可以大于2,元素模式相同,通过array()创建

dim1 <- c("A1","A2")
dim2 <- c("B1","B2","B3","B4")
dim3 <- c("C1","C2","C3")
myarray <- array(1:24,c(2,4,3),dimnames = list(dim1,dim2,dim3));myarray
## , , C1
## 
##    B1 B2 B3 B4
## A1  1  3  5  7
## A2  2  4  6  8
## 
## , , C2
## 
##    B1 B2 B3 B4
## A1  9 11 13 15
## A2 10 12 14 16
## 
## , , C3
## 
##    B1 B2 B3 B4
## A1 17 19 21 23
## A2 18 20 22 24
mylist <- list(Dmat,Dmat2,myarray)
mylist[3]
## [[1]]
## , , C1
## 
##    B1 B2 B3 B4
## A1  1  3  5  7
## A2  2  4  6  8
## 
## , , C2
## 
##    B1 B2 B3 B4
## A1  9 11 13 15
## A2 10 12 14 16
## 
## , , C3
## 
##    B1 B2 B3 B4
## A1 17 19 21 23
## A2 18 20 22 24

数据框:

不同的列可以包含不同模式的数据

Dframe<-as.data.frame(cbind(x1,x2));Dframe
dframe <- data.frame("n"=x1,"m"=x2)
class(Dframe)
## [1] "data.frame"
Dmat;Dframe  
##      x1 x2
## [1,]  1  2
## [2,]  2  4
## [3,]  3  6
## [4,]  4  8
## [5,]  5 10
## [6,]  6 12
df <- iris

矩阵和数据框格式是不一样的,矩阵是由行列组成的,数据框是由记录和变量组成


3 建表

清除环境

rm(list=ls())      
setwd("J:/port")

3.1读取外部数据

.csv

Stata,SAS,SPSS(foreign)

.txt{read.table/read.csv,sep=“./”}

library(rJava)
library(xlsxjars)
library(xlsx)
port <- read.xlsx("port.xls",sheetIndex=2,header=T,encoding='UTF-8')

port2 <- read.csv(“port.csv”,header=T,encoding=‘UTF-8’,row.names = 1)

3.2看数据结构

names(port)
##  [1] "no"              "hosp"            "patient"        
##  [4] "birth_y"         "birth_m"         "birth_d"        
##  [7] "gender"          "height"          "weight"         
## [10] "location"        "diagnosis"       "complication"   
## [13] "device"          "Picc_n"          "Picc_m"         
## [16] "Port_m"          "drug"            "irriate"        
## [19] "impact"          "mobility"        "care"           
## [22] "care_h"          "wtp"             "outpatient"     
## [25] "hospitalization" "maintain"        "eq1"            
## [28] "eq2"             "eq3"             "eq4"            
## [31] "eq5"             "v51"             "v52"            
## [34] "v53"             "v54"             "v55"            
## [37] "v56"             "v57"             "v58"            
## [40] "utility"
str(port)
## 'data.frame':    433 obs. of  40 variables:
##  $ no             : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ hosp           : Factor w/ 2 levels "G","S": 1 1 1 1 1 1 1 1 1 1 ...
##  $ patient        : Factor w/ 396 levels "AJL","AL","BCG",..: 147 60 38 372 375 373 326 156 42 225 ...
##  $ birth_y        : num  1978 1978 1971 1978 1988 ...
##  $ birth_m        : num  4 5 3 2 8 5 3 8 12 5 ...
##  $ birth_d        : num  11 6 8 3 12 1 2 22 5 2 ...
##  $ gender         : num  2 1 1 1 2 1 1 2 1 2 ...
##  $ height         : num  156 165 173 160 172 169 166 157 170 155 ...
##  $ weight         : num  46 40 85 50 63 58 62 71 60 55 ...
##  $ location       : Factor w/ 114 levels "安徽-安庆市-太湖县",..: 19 25 25 31 67 80 31 32 25 11 ...
##  $ diagnosis      : num  4 4 4 4 1 4 4 4 4 4 ...
##  $ complication   : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ device         : num  2 1 1 1 2 1 1 1 1 1 ...
##  $ Picc_n         : num  -3 1 1 1 -3 1 1 1 1 1 ...
##  $ Picc_m         : num  -3 5 5 2 -3 3 5 3 2 2 ...
##  $ Port_m         : Factor w/ 30 levels "-3","1","10",..: 30 1 1 1 28 1 1 1 1 1 ...
##  $ drug           : Factor w/ 131 levels "5-FU","avbd",..: 69 106 97 106 69 58 127 97 106 78 ...
##  $ irriate        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ impact         : num  2 3 4 2 2 2 2 2 2 2 ...
##  $ mobility       : num  2 2 4 2 2 2 1 1 2 2 ...
##  $ care           : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ care_h         : num  -3 -3 -3 -3 -3 -3 -3 -3 -3 -3 ...
##  $ wtp            : num  5 30 15 8 20 8 9 8 8 8 ...
##  $ outpatient     : num  1 15 4 4 1 4 4 4 4 4 ...
##  $ hospitalization: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ maintain       : Factor w/ 10 levels "0","1","2","3",..: 2 10 8 5 2 5 5 5 5 5 ...
##  $ eq1            : num  1 1 1 1 2 1 1 1 1 1 ...
##  $ eq2            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ eq3            : num  1 2 3 1 2 2 1 1 1 1 ...
##  $ eq4            : num  1 1 1 1 2 1 1 1 1 2 ...
##  $ eq5            : num  1 1 2 1 1 2 2 1 2 1 ...
##  $ v51            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ v52            : num  0 1 1 1 0 1 1 1 1 1 ...
##  $ v53            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ v54            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ v55            : num  1 1 0 0 1 0 1 1 1 1 ...
##  $ v56            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ v57            : num  1 0 0 0 0 0 1 1 1 1 ...
##  $ v58            : num  1 0 1 0 0 0 0 0 0 0 ...
##  $ utility        : num  1 0.955 0.844 1 0.831 0.906 0.951 1 0.951 0.942 ...
head(port)
tail(port)
3.3数据整理
port = transform(port,date=2018)
port$age <- port$date-port$birth_y
hist(port$age)

mean(port$age)
## [1] 51.99769
groupvec <- c(0,40,60,100)
labels <- c('young','midaged','senior')
port$agecat <- with(port,
                     cut(age,breaks = groupvec,labels = labels))
3.4编辑数据框
library(reshape)
names(port)[12] <- "comorbidity"

fix(port) df2 <- edit(port)
renamedf <- rename(df,c(oldname1=“newname”, oldname2=“newname2”)) #变量重命名

从字符串“abcdef”中提取出第2到4个位置上的字符

substr(“abcdef”, 2, 4)

substring(“abcdef”, 1:6, 1:6)

从字符串“abcdef”中提取出第1到1、2到2—6到6位置上的字符,即把字符串单个化

port$location <- as.character(port$location)
port$loc <- substr(port$location,1,2)
port$loc <- factor(port$loc)
levels(port$loc)
##  [1] "安徽" "福建" "广东" "广西" "贵州" "海南" "河南" "湖北" "湖南" "江苏"
## [11] "江西" "辽宁" "山东"

截取字符

3.5频数分布

freq <- lapply(port[,c(7,11:13)],table) ;freq 
## $gender
## 
##   1   2 
## 179 254 
## 
## $diagnosis
## 
##   1   2   3   4 
## 113  24  23 273 
## 
## $comorbidity
## 
##   1   2 
##  53 380 
## 
## $device
## 
##   1   2 
## 225 208
options(digits = 4)
prop <- lapply(freq[1:4],prop.table);prop
## $gender
## 
##      1      2 
## 0.4134 0.5866 
## 
## $diagnosis
## 
##       1       2       3       4 
## 0.26097 0.05543 0.05312 0.63048 
## 
## $comorbidity
## 
##      1      2 
## 0.1224 0.8776 
## 
## $device
## 
##      1      2 
## 0.5196 0.4804

3.6建表

character <- c(names(freq[1]),names(freq[[1]]))
NOC <- c(NA,paste0(freq[[1]],"(",round(prop[[1]]*100,2),")"))
characteristic <- data.frame("characteritic"=character,
                             "number of case"=NOC);characteristic
char <- NULL
for(i in 1:4){
  character <- c(names(freq[i]),names(freq[[i]]))
  NOC <- c(NA,paste0(freq[[i]],"(",round(prop[[i]]*100,2),")"))
  characteristics <- data.frame("characteritics"=character,
                                "number of case(%)"=NOC)
  char <- rbind(char,characteristics)};char

3.7输出

write.csv(char,"characteristics.csv",row.names = F)

3.8正态性检验

library(ggpubr)
## Loading required package: ggplot2
## Loading required package: magrittr

载入包

ggdensity(port$utility)

port$device <- as.factor(port$device)
ggdensity(port,x="utility",color="device",fill="device",palette = c("blue","pink"),
          rug = T,add = "mean")

密度图

gghistogram(port,x="utility",color="device",fill="device",palette = c("blue","pink"),
          rug = T,add = "mean")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.

直方图

ggqqplot(port,x="utility",color="device")

qq图

shapiro.test(port$utility)
## 
##  Shapiro-Wilk normality test
## 
## data:  port$utility
## W = 0.69, p-value <2e-16
shapiro.test(port$utility[port$device==1])
## 
##  Shapiro-Wilk normality test
## 
## data:  port$utility[port$device == 1]
## W = 0.81, p-value = 1e-15
shapiro.test(port$utility[port$device==2])
## 
##  Shapiro-Wilk normality test
## 
## data:  port$utility[port$device == 2]
## W = 0.53, p-value <2e-16
tapply(port$utility, port$device, shapiro.test)
## $`1`
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.81, p-value = 1e-15
## 
## 
## $`2`
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.53, p-value <2e-16

正态性检验

p<0.05 不符合正态性检验,不能用t.test,用wilcox.test

wilcox.test(utility~device,data = port,excat="F",alternative="l")
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  utility by device
## W = 17000, p-value = 4e-07
## alternative hypothesis: true location shift is less than 0
wilcox.test(utility~device,data = port,excat="F",alternative="t")
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  utility by device
## W = 17000, p-value = 8e-07
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(port$utility,mu=0.95,excat="F",alternative="t")
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  port$utility
## V = 45000, p-value = 0.5
## alternative hypothesis: true location is not equal to 0.95

wilcox.test

wilcox.test(utility~device,data = port,excat=“F”,alternative=“t”,paired=T)

t.test

若p>0.05,数据符合正态分布,则用t.test

方差齐性检验

var.test(utility~device,data = port)
## 
##  F test to compare two variances
## 
## data:  utility by device
## F = 0.96, num df = 220, denom df = 210, p-value = 0.8
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.7339 1.2542
## sample estimates:
## ratio of variances 
##             0.9601

p>0.05 方差齐,经典t检验

t.test

非配对

t.test(utility~device,data = port,var.equal=T)
## 
##  Two Sample t-test
## 
## data:  utility by device
## t = -3.7, df = 430, p-value = 2e-04
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.05344 -0.01662
## sample estimates:
## mean in group 1 mean in group 2 
##          0.9144          0.9494
diff <- with(ToothGrowth,len[supp=="OJ"]-len[supp=="VC"])
shapiro.test(diff)
## 
##  Shapiro-Wilk normality test
## 
## data:  diff
## W = 0.95, p-value = 0.1
t.test(len~supp,data = ToothGrowth,paired=T)
## 
##  Paired t-test
## 
## data:  len by supp
## t = 3.3, df = 29, p-value = 0.003
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.409 5.991
## sample estimates:
## mean of the differences 
##                     3.7

配对

diff <- with(port,utility[device==1]-utility[device==2]) shapiro.test(diff) t.test(utility~device,data = port,excat=“F”,alternative=“t”,paired=T)

t.test(port$utility,mu=0.95,excat="F",alternative="t")
## 
##  One Sample t-test
## 
## data:  port$utility
## t = -4, df = 430, p-value = 9e-05
## alternative hypothesis: true mean is not equal to 0.95
## 95 percent confidence interval:
##  0.9219 0.9405
## sample estimates:
## mean of x 
##    0.9312

单样本

作图

ggboxplot(port,x="device",y="utility",color = "device",palette = "lancet",
          xlab = "Device",ylab = "Utility value",title = "UTILITY")+
  stat_compare_means(method = "wilcox.test",paired = F,method.args = list(alternative="l"),
                     label.x = 2,label.y = 1.2)

方差分析

方差齐性检验 转为因子型变量

port$diagnosisf <- factor(port$diagnosis,ordered = F)

正态性检验

with(port,tapply(utility, diagnosisf, shapiro.test))
## $`1`
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.7, p-value = 7e-14
## 
## 
## $`2`
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.53, p-value = 1e-07
## 
## 
## $`3`
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.67, p-value = 7e-06
## 
## 
## $`4`
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.7, p-value <2e-16

方差齐性检验 wilcox.test适用于两样本比较

library(car)
## Loading required package: carData
leveneTest(utility~diagnosisf,port)

p>0.05

单因素ANOVA

aov1 <- aov(utility~diagnosisf,port)
summary(aov1)
##              Df Sum Sq Mean Sq F value Pr(>F)  
## diagnosisf    3   0.08 0.02658    2.75  0.042 *
## Residuals   429   4.14 0.00965                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
res1 <- aov1$residuals
shapiro.test(res1)
## 
##  Shapiro-Wilk normality test
## 
## data:  res1
## W = 0.73, p-value <2e-16
ggqqplot(res1)

qqplot

诊断模型

TukeyHSD(aov1)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = utility ~ diagnosisf, data = port)
## 
## $diagnosisf
##          diff      lwr      upr  p adj
## 2-1  0.022047 -0.03489 0.078983 0.7502
## 3-1 -0.003607 -0.06156 0.054341 0.9985
## 4-1 -0.023366 -0.05170 0.004971 0.1463
## 3-2 -0.025654 -0.09957 0.048264 0.8074
## 4-2 -0.045413 -0.09935 0.008522 0.1328
## 4-3 -0.019759 -0.07476 0.035243 0.7907

两两比较

ggboxplot(port,x="diagnosisf",y="utility",fill = "diagnosisf",palette = "lancet")+
          stat_compare_means(method = "anova",label.x =2,label.y = 0.25)

4 regression

4.1 选取数据集

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following object is masked from 'package:reshape':
## 
##     rename
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
port_lm <- select(port,hosp,agecat,height,weight,gender,diagnosis,comorbidity,device,irriate,utility)%>%
           mutate(bmi=weight/(height/100)^2)%>%
  within({bmicat <- NA 
  bmicat[bmi >= 24] <- "overweight"
  bmicat[bmi <= 18.5] <- "underweight"
  bmicat[bmi < 24 & bmi> 18.5] <- "ideal weight"})%>%
  select(-height,-weight,-bmi)
library(dplyr)
port_lm <- port[,c(2,43,7:9,11:13,18,40)]%>%
           mutate(bmi=weight/(height/100)^2)%>%
  within({bmicat <- NA 
  bmicat[bmi >= 24] <- "overweight"
  bmicat[bmi <= 18.5] <- "underweight"
  bmicat[bmi < 24 & bmi> 18.5] <- "ideal weight"})

4.2 转为因子型变量

port_lm$hosp <- as.factor(port_lm$hosp)
port_lm$gender <- as.factor(port_lm$gender)
port_lm$diagnosis <- as.factor(port_lm$diagnosis)
port_lm$device <- as.factor(port_lm$device)
port_lm$comorbidity <- as.factor(port_lm$comorbidity)
port_lm$irriate <- as.factor(port_lm$irriate)
port_lm$bmicat <- as.factor(port_lm$bmicat)
port_lm$agecat <- as.factor(port_lm$agecat)

4.3 fit.lm

fit.lm <- lm(utility~.,data = port_lm)
summary(fit.lm)
## 
## Call:
## lm(formula = utility ~ ., data = port_lm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8904 -0.0265  0.0219  0.0509  0.1338 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.102707   0.474180    2.33  0.02052 *  
## hospS              0.040699   0.011184    3.64  0.00031 ***
## agecatmidaged      0.014396   0.012454    1.16  0.24836    
## agecatsenior       0.005101   0.014872    0.34  0.73180    
## gender2           -0.031687   0.014492   -2.19  0.02934 *  
## height            -0.001309   0.002917   -0.45  0.65383    
## weight             0.000563   0.003659    0.15  0.87776    
## diagnosis2         0.005875   0.023220    0.25  0.80038    
## diagnosis3        -0.013680   0.024214   -0.56  0.57240    
## diagnosis4        -0.013383   0.012764   -1.05  0.29503    
## comorbidity2      -0.005367   0.015668   -0.34  0.73213    
## device2            0.032933   0.009812    3.36  0.00086 ***
## irriate2          -0.007481   0.011120   -0.67  0.50145    
## bmi                0.000169   0.009552    0.02  0.98593    
## bmicatoverweight  -0.016878   0.014595   -1.16  0.24818    
## bmicatunderweight -0.009996   0.018160   -0.55  0.58232    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0951 on 417 degrees of freedom
## Multiple R-squared:  0.106,  Adjusted R-squared:  0.0736 
## F-statistic: 3.29 on 15 and 417 DF,  p-value: 3.24e-05

filt.lm 置信区间

confint(fit.lm)
##                       2.5 %    97.5 %
## (Intercept)        0.170626  2.034788
## hospS              0.018714  0.062684
## agecatmidaged     -0.010084  0.038876
## agecatsenior      -0.024133  0.034334
## gender2           -0.060173 -0.003200
## height            -0.007044  0.004425
## weight            -0.006630  0.007756
## diagnosis2        -0.039768  0.051518
## diagnosis3        -0.061278  0.033917
## diagnosis4        -0.038473  0.011707
## comorbidity2      -0.036165  0.025431
## device2            0.013645  0.052220
## irriate2          -0.029339  0.014376
## bmi               -0.018607  0.018944
## bmicatoverweight  -0.045566  0.011811
## bmicatunderweight -0.045694  0.025702
filt.lm 模型诊断
par(mfrow=c(2,2))
plot(fit.lm)

4.4 剔除离群值
port[c(256,281,295),]

查看离群观测

fit.lm2 <- lm(utility~.,data = port_lm[-c(256),])  
summary(fit.lm2)
## 
## Call:
## lm(formula = utility ~ ., data = port_lm[-c(256), ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5491 -0.0269  0.0191  0.0485  0.1283 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.937432   0.420424    2.23    0.026 *  
## hospS              0.044799   0.009917    4.52  8.2e-06 ***
## agecatmidaged      0.012865   0.011035    1.17    0.244    
## agecatsenior       0.011103   0.013189    0.84    0.400    
## gender2           -0.027412   0.012847   -2.13    0.033 *  
## height            -0.000383   0.002586   -0.15    0.882    
## weight            -0.000924   0.003245   -0.28    0.776    
## diagnosis2         0.008287   0.020575    0.40    0.687    
## diagnosis3        -0.007231   0.021463   -0.34    0.736    
## diagnosis4        -0.003972   0.011344   -0.35    0.726    
## comorbidity2       0.003749   0.013908    0.27    0.788    
## device2            0.039665   0.008717    4.55  7.0e-06 ***
## irriate2          -0.009633   0.009854   -0.98    0.329    
## bmi                0.003732   0.008470    0.44    0.660    
## bmicatoverweight  -0.006351   0.012969   -0.49    0.625    
## bmicatunderweight -0.011952   0.016092   -0.74    0.458    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0843 on 416 degrees of freedom
## Multiple R-squared:  0.146,  Adjusted R-squared:  0.116 
## F-statistic: 4.76 on 15 and 416 DF,  p-value: 1.6e-08
par(mfrow=c(2,2))
plot(fit.lm2)

summary(lm(utility~diagnosis*gender,data=port_lm[-256,]))

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
stepAIC(fit.lm2,direction = "backward")
## Start:  AIC=-2122
## utility ~ hosp + agecat + gender + height + weight + diagnosis + 
##     comorbidity + device + irriate + bmi + bmicat
## 
##               Df Sum of Sq  RSS   AIC
## - diagnosis    3    0.0041 2.96 -2127
## - bmicat       2    0.0068 2.96 -2125
## - agecat       2    0.0098 2.96 -2124
## - height       1    0.0002 2.95 -2124
## - comorbidity  1    0.0005 2.96 -2124
## - weight       1    0.0006 2.96 -2124
## - bmi          1    0.0014 2.96 -2123
## - irriate      1    0.0068 2.96 -2123
## <none>                     2.95 -2122
## - gender       1    0.0323 2.99 -2119
## - hosp         1    0.1449 3.10 -2103
## - device       1    0.1471 3.10 -2103
## 
## Step:  AIC=-2127
## utility ~ hosp + agecat + gender + height + weight + comorbidity + 
##     device + irriate + bmi + bmicat
## 
##               Df Sum of Sq  RSS   AIC
## - bmicat       2    0.0077 2.97 -2130
## - agecat       2    0.0097 2.97 -2130
## - height       1    0.0000 2.96 -2129
## - comorbidity  1    0.0005 2.96 -2129
## - weight       1    0.0011 2.96 -2129
## - bmi          1    0.0022 2.96 -2129
## - irriate      1    0.0072 2.96 -2128
## <none>                     2.96 -2127
## - gender       1    0.0328 2.99 -2124
## - hosp         1    0.1574 3.12 -2107
## - device       1    0.1684 3.13 -2105
## 
## Step:  AIC=-2130
## utility ~ hosp + agecat + gender + height + weight + comorbidity + 
##     device + irriate + bmi
## 
##               Df Sum of Sq  RSS   AIC
## - agecat       2    0.0101 2.98 -2132
## - height       1    0.0000 2.97 -2132
## - comorbidity  1    0.0007 2.97 -2132
## - weight       1    0.0013 2.97 -2132
## - bmi          1    0.0023 2.97 -2132
## - irriate      1    0.0080 2.97 -2131
## <none>                     2.97 -2130
## - gender       1    0.0322 3.00 -2127
## - hosp         1    0.1574 3.12 -2110
## - device       1    0.1762 3.14 -2107
## 
## Step:  AIC=-2132
## utility ~ hosp + gender + height + weight + comorbidity + device + 
##     irriate + bmi
## 
##               Df Sum of Sq  RSS   AIC
## - comorbidity  1    0.0003 2.98 -2134
## - height       1    0.0003 2.98 -2134
## - weight       1    0.0006 2.98 -2134
## - bmi          1    0.0015 2.98 -2134
## - irriate      1    0.0105 2.99 -2133
## <none>                     2.98 -2132
## - gender       1    0.0412 3.02 -2128
## - hosp         1    0.1693 3.15 -2111
## - device       1    0.1870 3.16 -2108
## 
## Step:  AIC=-2134
## utility ~ hosp + gender + height + weight + device + irriate + 
##     bmi
## 
##           Df Sum of Sq  RSS   AIC
## - height   1    0.0003 2.98 -2136
## - weight   1    0.0006 2.98 -2136
## - bmi      1    0.0015 2.98 -2136
## - irriate  1    0.0103 2.99 -2135
## <none>                 2.98 -2134
## - gender   1    0.0410 3.02 -2130
## - hosp     1    0.1833 3.16 -2111
## - device   1    0.1886 3.17 -2110
## 
## Step:  AIC=-2136
## utility ~ hosp + gender + weight + device + irriate + bmi
## 
##           Df Sum of Sq  RSS   AIC
## - irriate  1    0.0105 2.99 -2137
## <none>                 2.98 -2136
## - weight   1    0.0192 3.00 -2136
## - bmi      1    0.0290 3.01 -2134
## - gender   1    0.0417 3.02 -2132
## - hosp     1    0.1830 3.16 -2113
## - device   1    0.1883 3.17 -2112
## 
## Step:  AIC=-2137
## utility ~ hosp + gender + weight + device + bmi
## 
##          Df Sum of Sq  RSS   AIC
## <none>                2.99 -2137
## - weight  1    0.0198 3.01 -2136
## - bmi     1    0.0309 3.02 -2134
## - gender  1    0.0422 3.03 -2133
## - device  1    0.1866 3.17 -2113
## - hosp    1    0.2356 3.22 -2106
## 
## Call:
## lm(formula = utility ~ hosp + gender + weight + device + bmi, 
##     data = port_lm[-c(256), ])
## 
## Coefficients:
## (Intercept)        hospS      gender2       weight      device2  
##     0.87426      0.04859     -0.02752     -0.00161      0.04232  
##         bmi  
##     0.00568
fit.lmstep <- lm(utility ~ hosp + gender + device, data = port_lm[-256, ])
summary(fit.lmstep)
## 
## Call:
## lm(formula = utility ~ hosp + gender + device, data = port_lm[-256, 
##     ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5621 -0.0315  0.0243  0.0528  0.1174 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.89821    0.00771  116.56  < 2e-16 ***
## hospS        0.05049    0.00820    6.16  1.7e-09 ***
## gender2     -0.01561    0.00844   -1.85    0.065 .  
## device2      0.04263    0.00823    5.18  3.4e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.084 on 428 degrees of freedom
## Multiple R-squared:  0.127,  Adjusted R-squared:  0.121 
## F-statistic: 20.8 on 3 and 428 DF,  p-value: 1.38e-12
4.5 模型比较

library(leaps) leaps <- regsubsets(utility ~ hosp + gender + device,data = port_lm,nbest = 4) par(mfrow=c(1,1)) plot(leaps,scale = “adjr2”)

4.6 制表

install.packages(“stargazer”)

library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
stargazer(port,header = F,type="text")
## 
## ====================================================================
## Statistic        N    Mean    St. Dev.  Min  Pctl(25) Pctl(75)  Max 
## --------------------------------------------------------------------
## no              433  217.000  125.100    1     109      325     433 
## birth_y         433 1,966.000  13.260  1,934  1,956    1,975   1,999
## birth_m         433   6.418    3.438     1      3        9      12  
## birth_d         433  12.260    8.858     1      5        20     31  
## gender          433   1.587    0.493     1      1        2       2  
## height          433  162.300   7.365    140    156      168     185 
## weight          433  59.450    11.760   38      52       65     156 
## diagnosis       433   3.053    1.316     1      1        4       4  
## comorbidity     433   1.878    0.328     1      2        2       2  
## Picc_n          433  -0.822    2.134    -3      -3       1       6  
## Picc_m          433   0.411    3.576    -3      -3       3      12  
## irriate         433   1.275    0.447     1      1        2       2  
## impact          433   1.471    0.581     1      1        2       4  
## mobility        433   1.480    0.545     1      1        2       4  
## care            433   1.409    0.492     1      1        2       2  
## care_h          433   3.952    7.835    -3      -3       7      24  
## wtp             433  12.660    23.810    0      2        10     200 
## outpatient      433   1.993    2.138     0      1        4      31  
## hospitalization 433   3.222    5.072     0      0        4      31  
## eq1             433   1.118    0.422     1      1        1       5  
## eq2             433   1.226    0.481     1      1        1       5  
## eq3             433   1.263    0.509     1      1        1       5  
## eq4             433   1.256    0.458     1      1        1       3  
## eq5             433   1.434    0.523     1      1        2       3  
## v51             433   0.014    0.117     0      0        0       1  
## v52             433   0.499    0.501     0      0        1       1  
## v53             433   0.023    0.150     0      0        0       1  
## v54             433   0.016    0.126     0      0        0       1  
## v55             433   0.711    0.454     0      0        1       1  
## v56             433   0.021    0.143     0      0        0       1  
## v57             433   0.617    0.487     0      0        1       1  
## v58             433   0.383    0.487     0      0        1       1  
## utility         433   0.931    0.099     0     0.9       1       1  
## date            433 2,018.000  0.000   2,018  2,018    2,018   2,018
## age             433  52.000    13.260   19      43       62     84  
## --------------------------------------------------------------------
stargazer(fit.lm,fit.lm2,title="result",type="text",align = T)
## 
## result
## ===================================================================
##                                   Dependent variable:              
##                     -----------------------------------------------
##                                         utility                    
##                               (1)                     (2)          
## -------------------------------------------------------------------
## hospS                      0.041***                0.045***        
##                             (0.011)                 (0.010)        
##                                                                    
## agecatmidaged                0.014                   0.013         
##                             (0.012)                 (0.011)        
##                                                                    
## agecatsenior                 0.005                   0.011         
##                             (0.015)                 (0.013)        
##                                                                    
## gender2                    -0.032**                -0.027**        
##                             (0.014)                 (0.013)        
##                                                                    
## height                      -0.001                  -0.0004        
##                             (0.003)                 (0.003)        
##                                                                    
## weight                       0.001                  -0.001         
##                             (0.004)                 (0.003)        
##                                                                    
## diagnosis2                   0.006                   0.008         
##                             (0.023)                 (0.021)        
##                                                                    
## diagnosis3                  -0.014                  -0.007         
##                             (0.024)                 (0.021)        
##                                                                    
## diagnosis4                  -0.013                  -0.004         
##                             (0.013)                 (0.011)        
##                                                                    
## comorbidity2                -0.005                   0.004         
##                             (0.016)                 (0.014)        
##                                                                    
## device2                    0.033***                0.040***        
##                             (0.010)                 (0.009)        
##                                                                    
## irriate2                    -0.007                  -0.010         
##                             (0.011)                 (0.010)        
##                                                                    
## bmi                         0.0002                   0.004         
##                             (0.010)                 (0.008)        
##                                                                    
## bmicatoverweight            -0.017                  -0.006         
##                             (0.015)                 (0.013)        
##                                                                    
## bmicatunderweight           -0.010                  -0.012         
##                             (0.018)                 (0.016)        
##                                                                    
## Constant                    1.103**                 0.937**        
##                             (0.474)                 (0.420)        
##                                                                    
## -------------------------------------------------------------------
## Observations                  433                     432          
## R2                           0.106                   0.146         
## Adjusted R2                  0.074                   0.116         
## Residual Std. Error    0.095 (df = 417)        0.084 (df = 416)    
## F Statistic         3.289*** (df = 15; 417) 4.758*** (df = 15; 416)
## ===================================================================
## Note:                                   *p<0.1; **p<0.05; ***p<0.01

5.example2

library(readxl)
CCAP <- read_excel("j:/port/CCAP.xlsx")
library(dplyr)
CCAP_qc <- CCAP%>%group_by(病案号,入院日期)%>%filter(row_number()==1)%>%ungroup()
CCAP1 <- CCAP_qc%>%filter(遴选意见=="入选"&主要诊断疾病名称=="支气管肺炎"|
                       遴选意见=="入选"&主要诊断疾病名称=="喘息型支气管肺炎")
names(CCAP1)
##  [1] "序号"                   "出院科别"              
##  [3] "病案号"                 "出院日期"              
##  [5] "单病种上报"             "主要诊断疾病名称"      
##  [7] "年份"                   "遴选意见"              
##  [9] "性别"                   "年龄"                  
## [11] "入院日期"               "入院科别"              
## [13] "天数"                   "ICD码"                 
## [15] "诊断类型"               "疾病名称"              
## [17] "主要诊断疾病名称__1"    "主要诊断ICD码"         
## [19] "其他诊断疾病名称"       "总费用"                
## [21] "床位费"                 "护理费"                
## [23] "西药费"                 "中药费"                
## [25] "化验费"                 "诊疗费"                
## [27] "手术费"                 "检查费"                
## [29] "麻醉费"                 "其他费用"              
## [31] "其他费"                 "治疗材料费"            
## [33] "西药费其中抗菌药物费用" "临床路径病例编号"      
## [35] "临床路径病例"           "新付款方式编号"        
## [37] "新付款方式"             "自费金额"
5.1缺失值处理

is.na(CCAP1)

sum(is.na(CCAP1))
## [1] 202
table(is.na(CCAP1)) 
## 
## FALSE  TRUE 
## 11692   202
CCAP1[complete.cases(CCAP1),]  #列出没有缺失值的行
nrow(CCAP1[complete.cases(CCAP1),])  #计算没有缺失值的样本量
## [1] 143
CCAP2 <- na.omit(CCAP1)        #删缺失所在行
table(is.na(CCAP2)) 
## 
## FALSE 
##  5434
5.2计算住院天数
CCAP2$出院日期 <- as.Date(CCAP2$出院日期)
CCAP2$入院日期 <- as.Date(CCAP2$入院日期)
CCAP2$住院天数=CCAP2$出院日期-CCAP2$入院日期
5.3划分费用构成
CCAP3 <- CCAP2%>%mutate(
  综合医疗服务费=床位费+护理费+诊疗费+其他费用,综合医疗服务费占比=round(综合医疗服务费/总费用*100,2),
  药费=西药费+中药费,药费占比=round(药费/总费用*100,2),
  手术治疗费=手术费+麻醉费+治疗材料费,手术治疗费占比=round(手术治疗费/总费用*100,2),
  诊断费=化验费+检查费,诊断费占比=round(诊断费/总费用*100,2),
  其他=其他费,其他占比=round(其他/总费用*100,2))%>%mutate(主要诊断疾病名称="儿童社区获得性肺炎")

6 大规模数据分析

6.1检测系统可用核数

library(parallel)
detectCores()                
## [1] 8

6.2 work place

ls() #work space

memory.size() #查看当前work place内存使用

gc()

memory.limit() #查看当前设置下最大内存(?“Memory-limits”)

memory.limit(newLimit)

memory.size(NA) #查看当前设置下最大内存(?“Memory-limits”)

7包的学习

install.packages(“ggplot2”)

library(ggplot2)

help(package=“ggplot2”)