This document explains how to create linear regression. The data used for the purpose is clinical trials data obtained from ACCT website. ACCT provides the data in well table structured in text format. The document explains steps for building a linear regression model using training data, testing the model on a test data and evaluating the model using R programming language.
For more detailed explaination and interpretations of results, and also to read other articles on Clinical Trials visit my blog here at https://businessintelligencedw.blogspot.com/.
The two variables that we are interested in the datasets are the number of studies completed by a study sponsor and number of studies with results posted by a sponsor. We want to create a model that predicts the number of results a sponsor would post (y or dependent variable) for a given number of completed studies (x or independent variable).
sponsor_f_lm<-subset.data.frame(sponsor_f, subset=cnt_completed_status>=1, select=c("name","agency_class","sponsor_type","flag_sponsor_industry","flag_sponsor_type_academic", "flag_sponsor_type_hospital","cnt_completed_status","cnt_results_submitted"))
Let’s take a look at the summary stats of x and y variables.
nrow(sponsor_f)
## [1] 28069
nrow(sponsor_f_lm)
## [1] 16890
count(sponsor_f, "agency_class")
## # A tibble: 28,069 x 7
## # Groups: name, agency_class, sponsor_type, flag_sponsor_industry,
## # flag_sponsor_type_academic [28,069]
## name agency_class sponsor_type flag_sponsor_in… flag_sponsor_ty…
## <fct> <fct> <fct> <fct> <fct>
## 1 [Red… "" NA 0 0
## 2 105 … Other Hospital 0 0
## 3 11 H… Industry NA 1 0
## 4 113t… Other Hospital 0 0
## 5 153r… Other Hospital 0 0
## 6 1Glo… Industry Academic 1 1
## 7 1st … Other Hospital 0 0
## 8 1st … Other Hospital 0 0
## 9 20/1… Industry NA 1 0
## 10 21st… Other NA 0 0
## # … with 28,059 more rows, and 2 more variables: `"agency_class"` <chr>,
## # n <int>
count(sponsor_f_lm, "agency_class")
## # A tibble: 16,890 x 7
## # Groups: name, agency_class, sponsor_type, flag_sponsor_industry,
## # flag_sponsor_type_academic [16,890]
## name agency_class sponsor_type flag_sponsor_in… flag_sponsor_ty…
## <fct> <fct> <fct> <fct> <fct>
## 1 153r… Other Hospital 0 0
## 2 1st … Other Hospital 0 0
## 3 21st… Other NA 0 0
## 4 22EON Industry NA 1 0
## 5 23an… Industry NA 1 0
## 6 251 … Other Hospital 0 0
## 7 2C T… Industry NA 1 0
## 8 3-C … Industry Academic 1 1
## 9 307 … Other Hospital 0 0
## 10 3E T… Industry NA 1 0
## # … with 16,880 more rows, and 2 more variables: `"agency_class"` <chr>,
## # n <int>
count(sponsor_f_lm, "cnt_completed_status")
## # A tibble: 16,890 x 7
## # Groups: name, agency_class, sponsor_type, flag_sponsor_industry,
## # flag_sponsor_type_academic [16,890]
## name agency_class sponsor_type flag_sponsor_in… flag_sponsor_ty…
## <fct> <fct> <fct> <fct> <fct>
## 1 153r… Other Hospital 0 0
## 2 1st … Other Hospital 0 0
## 3 21st… Other NA 0 0
## 4 22EON Industry NA 1 0
## 5 23an… Industry NA 1 0
## 6 251 … Other Hospital 0 0
## 7 2C T… Industry NA 1 0
## 8 3-C … Industry Academic 1 1
## 9 307 … Other Hospital 0 0
## 10 3E T… Industry NA 1 0
## # … with 16,880 more rows, and 2 more variables:
## # `"cnt_completed_status"` <chr>, n <int>
summary(sponsor_f_lm$cnt_completed_status)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 9.918 4.000 2830.000
summary(sponsor_f_lm$cnt_results_submitted)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.7848 0.0000 575.0000
Histogram: Let’s take a look at the distribution.
#Create histogram
plot_hist_1<-ggplot(data = sponsor_f_lm,aes(x=cnt_completed_status))+
geom_histogram(binwidth = 5)+
ggtitle("Histogram: Completed Studies")+
xlab("Completed")+
ylab("Sponsors Count")+
#xlim(0,200)+
#ylim(0,600)+
theme_bw()
ggplot_hist_1<-ggplotly(plot_hist_1)
ggplot_hist_1
Let’s take a look if there is any relationship between our X and Y variables.
Scatterplot 1:
#scatterplot
plot_scatter_1<-ggplot(data=sponsor_f_lm, aes(x=cnt_completed_status, y=cnt_results_submitted))+
geom_point()+
geom_smooth(method = "lm")+
ggtitle("Completed Vs Results Posted")+
xlab("Completed Studies")+
ylab("Results Posted")+
theme_bw()
ggplot_scatter_1<-ggplotly(plot_scatter_1)
ggplot_scatter_1
Let’s divide our data into 2 sets. We will randomly assign 70% of the data as training data and remaining 30% to test data set.
#create linear regression using training and test data
#generate training and test data
set.seed(123)
sampling_index<- sample(x=2, nrow(sponsor_f_lm), replace = TRUE, prob = c(0.7,0.3))
head(sampling_index)
## [1] 1 2 1 2 2 1
class(sponsor_f_lm)
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
train_set<-sponsor_f[sampling_index==1,]
## Warning: Length of logical index must be 1 or 28069, not 16890
test_set<-sponsor_f[sampling_index==2,]
## Warning: Length of logical index must be 1 or 28069, not 16890
nrow(sponsor_f_lm)
## [1] 16890
nrow(train_set)
## [1] 19739
nrow(test_set)
## [1] 8330
#create regression model
lin_model_1<-lm(formula=cnt_results_submitted ~ cnt_completed_status_lstyr,data=train_set)
summary(lin_model_1)
##
## Call:
## lm(formula = cnt_results_submitted ~ cnt_completed_status_lstyr,
## data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -105.80 0.27 0.27 0.27 448.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.26914 0.03751 -7.175 7.49e-13 ***
## cnt_completed_status_lstyr 1.52960 0.01259 121.502 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.195 on 19737 degrees of freedom
## Multiple R-squared: 0.4279, Adjusted R-squared: 0.4279
## F-statistic: 1.476e+04 on 1 and 19737 DF, p-value: < 2.2e-16
Scatterplot2:
End of Document