Overview

This document explains how to create linear regression. The data used for the purpose is clinical trials data obtained from ACCT website. ACCT provides the data in well table structured in text format. The document explains steps for building a linear regression model using training data, testing the model on a test data and evaluating the model using R programming language.
For more detailed explaination and interpretations of results, and also to read other articles on Clinical Trials visit my blog here at https://businessintelligencedw.blogspot.com/.

Linear regression model

The two variables that we are interested in the datasets are the number of studies completed by a study sponsor and number of studies with results posted by a sponsor. We want to create a model that predicts the number of results a sponsor would post (y or dependent variable) for a given number of completed studies (x or independent variable).

sponsor_f_lm<-subset.data.frame(sponsor_f, subset=cnt_completed_status>=1, select=c("name","agency_class","sponsor_type","flag_sponsor_industry","flag_sponsor_type_academic", "flag_sponsor_type_hospital","cnt_completed_status","cnt_results_submitted"))

Let’s take a look at the summary stats of x and y variables.

nrow(sponsor_f)
## [1] 28069
nrow(sponsor_f_lm)
## [1] 16890
count(sponsor_f, "agency_class")
## # A tibble: 28,069 x 7
## # Groups:   name, agency_class, sponsor_type, flag_sponsor_industry,
## #   flag_sponsor_type_academic [28,069]
##    name  agency_class sponsor_type flag_sponsor_in… flag_sponsor_ty…
##    <fct> <fct>        <fct>        <fct>            <fct>           
##  1 [Red… ""           NA           0                0               
##  2 105 … Other        Hospital     0                0               
##  3 11 H… Industry     NA           1                0               
##  4 113t… Other        Hospital     0                0               
##  5 153r… Other        Hospital     0                0               
##  6 1Glo… Industry     Academic     1                1               
##  7 1st … Other        Hospital     0                0               
##  8 1st … Other        Hospital     0                0               
##  9 20/1… Industry     NA           1                0               
## 10 21st… Other        NA           0                0               
## # … with 28,059 more rows, and 2 more variables: `"agency_class"` <chr>,
## #   n <int>
count(sponsor_f_lm, "agency_class")
## # A tibble: 16,890 x 7
## # Groups:   name, agency_class, sponsor_type, flag_sponsor_industry,
## #   flag_sponsor_type_academic [16,890]
##    name  agency_class sponsor_type flag_sponsor_in… flag_sponsor_ty…
##    <fct> <fct>        <fct>        <fct>            <fct>           
##  1 153r… Other        Hospital     0                0               
##  2 1st … Other        Hospital     0                0               
##  3 21st… Other        NA           0                0               
##  4 22EON Industry     NA           1                0               
##  5 23an… Industry     NA           1                0               
##  6 251 … Other        Hospital     0                0               
##  7 2C T… Industry     NA           1                0               
##  8 3-C … Industry     Academic     1                1               
##  9 307 … Other        Hospital     0                0               
## 10 3E T… Industry     NA           1                0               
## # … with 16,880 more rows, and 2 more variables: `"agency_class"` <chr>,
## #   n <int>
count(sponsor_f_lm, "cnt_completed_status")
## # A tibble: 16,890 x 7
## # Groups:   name, agency_class, sponsor_type, flag_sponsor_industry,
## #   flag_sponsor_type_academic [16,890]
##    name  agency_class sponsor_type flag_sponsor_in… flag_sponsor_ty…
##    <fct> <fct>        <fct>        <fct>            <fct>           
##  1 153r… Other        Hospital     0                0               
##  2 1st … Other        Hospital     0                0               
##  3 21st… Other        NA           0                0               
##  4 22EON Industry     NA           1                0               
##  5 23an… Industry     NA           1                0               
##  6 251 … Other        Hospital     0                0               
##  7 2C T… Industry     NA           1                0               
##  8 3-C … Industry     Academic     1                1               
##  9 307 … Other        Hospital     0                0               
## 10 3E T… Industry     NA           1                0               
## # … with 16,880 more rows, and 2 more variables:
## #   `"cnt_completed_status"` <chr>, n <int>
summary(sponsor_f_lm$cnt_completed_status)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    1.000    1.000    9.918    4.000 2830.000
summary(sponsor_f_lm$cnt_results_submitted)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0000   0.0000   0.7848   0.0000 575.0000

Histogram: Let’s take a look at the distribution.

#Create histogram
plot_hist_1<-ggplot(data = sponsor_f_lm,aes(x=cnt_completed_status))+
  geom_histogram(binwidth = 5)+
  ggtitle("Histogram: Completed Studies")+
  xlab("Completed")+
  ylab("Sponsors Count")+
  #xlim(0,200)+
  #ylim(0,600)+
  theme_bw()
ggplot_hist_1<-ggplotly(plot_hist_1)
ggplot_hist_1

Let’s take a look if there is any relationship between our X and Y variables.

Scatterplot 1:

#scatterplot
plot_scatter_1<-ggplot(data=sponsor_f_lm, aes(x=cnt_completed_status, y=cnt_results_submitted))+
  geom_point()+
  geom_smooth(method = "lm")+
  ggtitle("Completed Vs Results Posted")+
  xlab("Completed Studies")+
  ylab("Results Posted")+
  theme_bw()
ggplot_scatter_1<-ggplotly(plot_scatter_1)
ggplot_scatter_1

Let’s divide our data into 2 sets. We will randomly assign 70% of the data as training data and remaining 30% to test data set.

#create linear regression using training and test data

#generate training and test data
set.seed(123)
sampling_index<- sample(x=2, nrow(sponsor_f_lm), replace = TRUE, prob = c(0.7,0.3))

head(sampling_index)
## [1] 1 2 1 2 2 1
class(sponsor_f_lm)
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
train_set<-sponsor_f[sampling_index==1,]
## Warning: Length of logical index must be 1 or 28069, not 16890
test_set<-sponsor_f[sampling_index==2,]
## Warning: Length of logical index must be 1 or 28069, not 16890
nrow(sponsor_f_lm)
## [1] 16890
nrow(train_set)
## [1] 19739
nrow(test_set)
## [1] 8330
#create regression model
lin_model_1<-lm(formula=cnt_results_submitted ~ cnt_completed_status_lstyr,data=train_set)
summary(lin_model_1)
## 
## Call:
## lm(formula = cnt_results_submitted ~ cnt_completed_status_lstyr, 
##     data = train_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -105.80    0.27    0.27    0.27  448.31 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -0.26914    0.03751  -7.175 7.49e-13 ***
## cnt_completed_status_lstyr  1.52960    0.01259 121.502  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.195 on 19737 degrees of freedom
## Multiple R-squared:  0.4279, Adjusted R-squared:  0.4279 
## F-statistic: 1.476e+04 on 1 and 19737 DF,  p-value: < 2.2e-16

Scatterplot2:

End of Document