1. Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.
theURL <- "https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/quantreg/uis.csv"
uis <- read.table(file = theURL, header = TRUE, sep = ",")
summary(uis)
##        X               ID             AGE             BECK      
##  Min.   :  1.0   Min.   :  1.0   Min.   :20.00   Min.   : 0.00  
##  1st Qu.:144.5   1st Qu.:164.5   1st Qu.:27.00   1st Qu.:10.00  
##  Median :288.0   Median :323.0   Median :32.00   Median :17.00  
##  Mean   :288.0   Mean   :319.2   Mean   :32.38   Mean   :17.37  
##  3rd Qu.:431.5   3rd Qu.:475.5   3rd Qu.:37.00   3rd Qu.:23.00  
##  Max.   :575.0   Max.   :628.0   Max.   :56.00   Max.   :54.00  
##        HC              IV             NDT              RACE       
##  Min.   :1.000   Min.   :1.000   Min.   : 0.000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 1.000   1st Qu.:0.0000  
##  Median :3.000   Median :2.000   Median : 3.000   Median :0.0000  
##  Mean   :2.786   Mean   :2.035   Mean   : 4.543   Mean   :0.2522  
##  3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.: 6.000   3rd Qu.:1.0000  
##  Max.   :4.000   Max.   :3.000   Max.   :40.000   Max.   :1.0000  
##      TREAT             SITE            LEN.T            TIME       
##  Min.   :0.0000   Min.   :0.0000   Min.   :  3.0   Min.   :   4.0  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 45.0   1st Qu.:  84.5  
##  Median :0.0000   Median :0.0000   Median : 87.0   Median : 170.0  
##  Mean   :0.4974   Mean   :0.3043   Mean   :100.8   Mean   : 241.6  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:150.5   3rd Qu.: 375.5  
##  Max.   :1.0000   Max.   :1.0000   Max.   :400.0   Max.   :1172.0  
##      CENSOR            Y              ND1               ND2          
##  Min.   :0.000   Min.   :1.386   Min.   : 0.2439   Min.   :-23.0259  
##  1st Qu.:1.000   1st Qu.:4.437   1st Qu.: 1.4286   1st Qu.: -8.0472  
##  Median :1.000   Median :5.136   Median : 2.5000   Median : -2.2907  
##  Mean   :0.807   Mean   :5.043   Mean   : 3.5507   Mean   : -5.5486  
##  3rd Qu.:1.000   3rd Qu.:5.928   3rd Qu.: 5.0000   3rd Qu.: -0.5095  
##  Max.   :1.000   Max.   :7.066   Max.   :10.0000   Max.   :  0.3679  
##       LNDT             FRAC              IV3        
##  Min.   :0.0000   Min.   :0.02222   Min.   :0.0000  
##  1st Qu.:0.6931   1st Qu.:0.33333   1st Qu.:0.0000  
##  Median :1.3863   Median :0.81111   Median :0.0000  
##  Mean   :1.3597   Mean   :0.78541   Mean   :0.4226  
##  3rd Qu.:1.9459   3rd Qu.:1.00000   3rd Qu.:1.0000  
##  Max.   :3.7136   Max.   :2.43333   Max.   :1.0000

I am exploring the UIS Drug Treatment study dataset. Participant ages at enrollment ranged from 20 to 56 years old, with a mean age of 32.38. Most participants were white versus non-white. As measured by the Censor variable, most participants either returned to drugs or were not able to be reached for a follow up.

  1. Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together)
newUIS <- subset(uis, select = -c(ND1,ND2,LNDT,FRAC,Y))
colnames(newUIS)[colnames(newUIS)=="CENSOR"] <- "Program Success"
colnames(newUIS)[colnames(newUIS)=="LEN.T"] <- "TreatmentDays"
head(newUIS)
##   X ID AGE  BECK HC IV NDT RACE TREAT SITE TreatmentDays TIME
## 1 1  1  39  9.00  4  3   1    0     1    0           123  188
## 2 2  2  33 34.00  4  2   8    0     1    0            25   26
## 3 3  3  33 10.00  2  3   3    0     1    0             7  207
## 4 4  4  32 20.00  4  3   1    0     0    0            66  144
## 5 5  5  24  5.00  2  1   5    1     1    0           173  551
## 6 6  6  30 32.55  3  3   1    0     1    0            16   32
##   Program Success IV3
## 1               1   1
## 2               1   0
## 3               1   1
## 4               1   1
## 5               0   0
## 6               1   1
  1. Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.
require(ggplot2)
## Loading required package: ggplot2
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'ggplot2'
hist(newUIS$AGE, main = "Age Histogram", xlab = "Age")

plot(TreatmentDays ~ AGE, data = newUIS, main = "Treatment Days Plotted Against Age")
abline(lm(newUIS$TreatmentDays ~ newUIS$AGE))

boxplot(newUIS$TreatmentDays)

summary(lm(newUIS$TreatmentDays ~ newUIS$AGE))
## 
## Call:
## lm(formula = newUIS$TreatmentDays ~ newUIS$AGE)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -106.20  -54.11  -15.71   51.63  295.41 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  66.4195    17.0421   3.897 0.000109 ***
## newUIS$AGE    1.0604     0.5169   2.051 0.040693 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76.7 on 573 degrees of freedom
## Multiple R-squared:  0.00729,    Adjusted R-squared:  0.005557 
## F-statistic: 4.208 on 1 and 573 DF,  p-value: 0.04069
  1. Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.

Question: Is there a correlation between the number of treatment days a client will undergo and their age?

Plotting a regression line for our scatterplot of Treatment days by Age in step 3, there appears to be a positive correlation between Age and the number of Treatment Days. Our summary of this regression line shows that our coefficient for Age is 1.06, and that this relationship is significant as the p-value of our t-test is 0.04 < 0.05.

  1. BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.

This exercise reads the CSV file from a link.