theURL <- "https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/quantreg/uis.csv"
uis <- read.table(file = theURL, header = TRUE, sep = ",")
summary(uis)
## X ID AGE BECK
## Min. : 1.0 Min. : 1.0 Min. :20.00 Min. : 0.00
## 1st Qu.:144.5 1st Qu.:164.5 1st Qu.:27.00 1st Qu.:10.00
## Median :288.0 Median :323.0 Median :32.00 Median :17.00
## Mean :288.0 Mean :319.2 Mean :32.38 Mean :17.37
## 3rd Qu.:431.5 3rd Qu.:475.5 3rd Qu.:37.00 3rd Qu.:23.00
## Max. :575.0 Max. :628.0 Max. :56.00 Max. :54.00
## HC IV NDT RACE
## Min. :1.000 Min. :1.000 Min. : 0.000 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 1.000 1st Qu.:0.0000
## Median :3.000 Median :2.000 Median : 3.000 Median :0.0000
## Mean :2.786 Mean :2.035 Mean : 4.543 Mean :0.2522
## 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.: 6.000 3rd Qu.:1.0000
## Max. :4.000 Max. :3.000 Max. :40.000 Max. :1.0000
## TREAT SITE LEN.T TIME
## Min. :0.0000 Min. :0.0000 Min. : 3.0 Min. : 4.0
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 45.0 1st Qu.: 84.5
## Median :0.0000 Median :0.0000 Median : 87.0 Median : 170.0
## Mean :0.4974 Mean :0.3043 Mean :100.8 Mean : 241.6
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:150.5 3rd Qu.: 375.5
## Max. :1.0000 Max. :1.0000 Max. :400.0 Max. :1172.0
## CENSOR Y ND1 ND2
## Min. :0.000 Min. :1.386 Min. : 0.2439 Min. :-23.0259
## 1st Qu.:1.000 1st Qu.:4.437 1st Qu.: 1.4286 1st Qu.: -8.0472
## Median :1.000 Median :5.136 Median : 2.5000 Median : -2.2907
## Mean :0.807 Mean :5.043 Mean : 3.5507 Mean : -5.5486
## 3rd Qu.:1.000 3rd Qu.:5.928 3rd Qu.: 5.0000 3rd Qu.: -0.5095
## Max. :1.000 Max. :7.066 Max. :10.0000 Max. : 0.3679
## LNDT FRAC IV3
## Min. :0.0000 Min. :0.02222 Min. :0.0000
## 1st Qu.:0.6931 1st Qu.:0.33333 1st Qu.:0.0000
## Median :1.3863 Median :0.81111 Median :0.0000
## Mean :1.3597 Mean :0.78541 Mean :0.4226
## 3rd Qu.:1.9459 3rd Qu.:1.00000 3rd Qu.:1.0000
## Max. :3.7136 Max. :2.43333 Max. :1.0000
I am exploring the UIS Drug Treatment study dataset. Participant ages at enrollment ranged from 20 to 56 years old, with a mean age of 32.38. Most participants were white versus non-white. As measured by the Censor variable, most participants either returned to drugs or were not able to be reached for a follow up.
newUIS <- subset(uis, select = -c(ND1,ND2,LNDT,FRAC,Y))
colnames(newUIS)[colnames(newUIS)=="CENSOR"] <- "Program Success"
colnames(newUIS)[colnames(newUIS)=="LEN.T"] <- "TreatmentDays"
head(newUIS)
## X ID AGE BECK HC IV NDT RACE TREAT SITE TreatmentDays TIME
## 1 1 1 39 9.00 4 3 1 0 1 0 123 188
## 2 2 2 33 34.00 4 2 8 0 1 0 25 26
## 3 3 3 33 10.00 2 3 3 0 1 0 7 207
## 4 4 4 32 20.00 4 3 1 0 0 0 66 144
## 5 5 5 24 5.00 2 1 5 1 1 0 173 551
## 6 6 6 30 32.55 3 3 1 0 1 0 16 32
## Program Success IV3
## 1 1 1
## 2 1 0
## 3 1 1
## 4 1 1
## 5 0 0
## 6 1 1
require(ggplot2)
## Loading required package: ggplot2
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'ggplot2'
hist(newUIS$AGE, main = "Age Histogram", xlab = "Age")
plot(TreatmentDays ~ AGE, data = newUIS, main = "Treatment Days Plotted Against Age")
abline(lm(newUIS$TreatmentDays ~ newUIS$AGE))
boxplot(newUIS$TreatmentDays)
summary(lm(newUIS$TreatmentDays ~ newUIS$AGE))
##
## Call:
## lm(formula = newUIS$TreatmentDays ~ newUIS$AGE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -106.20 -54.11 -15.71 51.63 295.41
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.4195 17.0421 3.897 0.000109 ***
## newUIS$AGE 1.0604 0.5169 2.051 0.040693 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76.7 on 573 degrees of freedom
## Multiple R-squared: 0.00729, Adjusted R-squared: 0.005557
## F-statistic: 4.208 on 1 and 573 DF, p-value: 0.04069
Question: Is there a correlation between the number of treatment days a client will undergo and their age?
Plotting a regression line for our scatterplot of Treatment days by Age in step 3, there appears to be a positive correlation between Age and the number of Treatment Days. Our summary of this regression line shows that our coefficient for Age is 1.06, and that this relationship is significant as the p-value of our t-test is 0.04 < 0.05.
This exercise reads the CSV file from a link.