title: “Final Project” author: “Mikhail Groysman” date: “July 25, 2018” output: html_document

# Data: Delay in AIDS Reporting in England and Wales

## Taken from: http://vincentarelbundock.github.io/Rdatasets/csv/datasets/aids.csv

### GitHub Raw file location: https://raw.githubusercontent.com/dvillalobos/CUNY-Bridge/master/aids.csv

# Description: Aids Cases diagnosed from July 1983 to Dec 1992 in England and Wales and their reporting dates as well as time between these two dates. Data is used to estimate delays in reporting of Aids cases to Communicable Disease Surveillance Centre.


# Data format

## X - integer - counter
## year - integer - year of diagnosis
## quarter - integer - quarter of diagnosis
## delay - integer - months of delay between reporting and diagnosis; longer delays grouped in 3 months intervals
## dud - integer - indicator of censored data - full information is not available
## time - integer - number of quarters between July 1983 and diagnosis date
## y - integer - number of Aids cases reported

# Source of data

## The data were obtained from De Angelis, D. and Gilks, W.R. (1994) Estimating acquired immune
## deficiency syndrome accounting for reporting delay.
## Journal of the Royal Statistical Society, A, 157, 31;40.

#Question 1

## reading csv file from my desktop

MyData <- read.csv(file="c:/Users/Dell/Desktop/Aids.csv", header=TRUE, sep=",")

head(MyData,n=50)
##     X year quarter delay dud time  y
## 1   1 1983       3     0   0    1  2
## 2   2 1983       3     2   0    1  6
## 3   3 1983       3     5   0    1  0
## 4   4 1983       3     8   0    1  1
## 5   5 1983       3    11   0    1  1
## 6   6 1983       3    14   0    1  0
## 7   7 1983       3    17   0    1  0
## 8   8 1983       3    20   0    1  1
## 9   9 1983       3    23   0    1  0
## 10 10 1983       3    26   0    1  0
## 11 11 1983       3    29   0    1  0
## 12 12 1983       3    32   0    1  0
## 13 13 1983       3    35   0    1  0
## 14 14 1983       3    38   0    1  0
## 15 15 1983       3    41   0    1  1
## 16 16 1983       4     0   0    2  2
## 17 17 1983       4     2   0    2  7
## 18 18 1983       4     5   0    2  1
## 19 19 1983       4     8   0    2  1
## 20 20 1983       4    11   0    2  1
## 21 21 1983       4    14   0    2  0
## 22 22 1983       4    17   0    2  0
## 23 23 1983       4    20   0    2  0
## 24 24 1983       4    23   0    2  0
## 25 25 1983       4    26   0    2  0
## 26 26 1983       4    29   0    2  0
## 27 27 1983       4    32   0    2  0
## 28 28 1983       4    35   0    2  0
## 29 29 1983       4    38   0    2  0
## 30 30 1983       4    41   0    2  0
## 31 31 1984       1     0   0    3  4
## 32 32 1984       1     2   0    3  4
## 33 33 1984       1     5   0    3  0
## 34 34 1984       1     8   0    3  1
## 35 35 1984       1    11   0    3  0
## 36 36 1984       1    14   0    3  2
## 37 37 1984       1    17   0    3  0
## 38 38 1984       1    20   0    3  0
## 39 39 1984       1    23   0    3  0
## 40 40 1984       1    26   0    3  0
## 41 41 1984       1    29   0    3  2
## 42 42 1984       1    32   0    3  1
## 43 43 1984       1    35   0    3  0
## 44 44 1984       1    38   0    3  0
## 45 45 1984       1    41   0    3  0
## 46 46 1984       2     0   0    4  0
## 47 47 1984       2     2   0    4 10
## 48 48 1984       2     5   0    4  0
## 49 49 1984       2     8   0    4  1
## 50 50 1984       2    11   0    4  1
## getting summary of dataframe

summary(MyData)
##        X              year         quarter          delay      
##  Min.   :  1.0   Min.   :1983   Min.   :1.000   Min.   : 0.00  
##  1st Qu.:143.2   1st Qu.:1985   1st Qu.:2.000   1st Qu.: 8.00  
##  Median :285.5   Median :1988   Median :3.000   Median :20.00  
##  Mean   :285.5   Mean   :1988   Mean   :2.553   Mean   :20.07  
##  3rd Qu.:427.8   3rd Qu.:1990   3rd Qu.:4.000   3rd Qu.:32.00  
##  Max.   :570.0   Max.   :1992   Max.   :4.000   Max.   :41.00  
##       dud              time            y         
##  Min.   :0.0000   Min.   : 1.0   Min.   :  0.00  
##  1st Qu.:0.0000   1st Qu.:10.0   1st Qu.:  0.00  
##  Median :0.0000   Median :19.5   Median :  2.00  
##  Mean   :0.1842   Mean   :19.5   Mean   : 11.08  
##  3rd Qu.:0.0000   3rd Qu.:29.0   3rd Qu.:  8.00  
##  Max.   :1.0000   Max.   :38.0   Max.   :181.00
## basic mean and medians of variables

mean(MyData$year)
## [1] 1987.737
mean(MyData$delay)
## [1] 20.06667
median(MyData$dud)
## [1] 0
median(MyData$y)
## [1] 2
## checking data type of variables

typeof(MyData$year)
## [1] "integer"
is.numeric(MyData$quarter)
## [1] TRUE
## not very useful transformation

MyData$yearChar<-as.character(MyData$year)

head(MyData$yearChar,n=50)
##  [1] "1983" "1983" "1983" "1983" "1983" "1983" "1983" "1983" "1983" "1983"
## [11] "1983" "1983" "1983" "1983" "1983" "1983" "1983" "1983" "1983" "1983"
## [21] "1983" "1983" "1983" "1983" "1983" "1983" "1983" "1983" "1983" "1983"
## [31] "1984" "1984" "1984" "1984" "1984" "1984" "1984" "1984" "1984" "1984"
## [41] "1984" "1984" "1984" "1984" "1984" "1984" "1984" "1984" "1984" "1984"
## check dud data type

class(MyData$dud)
## [1] "integer"
### Some basic conclusions. Year records are pretty evenly distributed. Quarter records are slightly skewed toward 2nd half of the year. Mean delays in quarters for records (not cases) was 20.07 quarters or ~5 years. Interestingly, mean and median are almost the same. Most of records are not censored. Mean of diagnoses in quarters for records (not cases) was 19.5 quarters or ~5 years. This makes sense. It seems that data is normally distributed. Number of cases is skewed to the right.


# Question 2

## Creating variable delays in quarters instead of months

MyData$delayQ<-MyData$delay/3

head(MyData$delayQ,n=50)
##  [1]  0.0000000  0.6666667  1.6666667  2.6666667  3.6666667  4.6666667
##  [7]  5.6666667  6.6666667  7.6666667  8.6666667  9.6666667 10.6666667
## [13] 11.6666667 12.6666667 13.6666667  0.0000000  0.6666667  1.6666667
## [19]  2.6666667  3.6666667  4.6666667  5.6666667  6.6666667  7.6666667
## [25]  8.6666667  9.6666667 10.6666667 11.6666667 12.6666667 13.6666667
## [31]  0.0000000  0.6666667  1.6666667  2.6666667  3.6666667  4.6666667
## [37]  5.6666667  6.6666667  7.6666667  8.6666667  9.6666667 10.6666667
## [43] 11.6666667 12.6666667 13.6666667  0.0000000  0.6666667  1.6666667
## [49]  2.6666667  3.6666667
## Creating variable "Censored" FALSE/TRUE

MyData$Censored<-(MyData$dud==1)

head(MyData$Censored,n=40)
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
head(MyData,n=50)
##     X year quarter delay dud time  y yearChar     delayQ Censored
## 1   1 1983       3     0   0    1  2     1983  0.0000000    FALSE
## 2   2 1983       3     2   0    1  6     1983  0.6666667    FALSE
## 3   3 1983       3     5   0    1  0     1983  1.6666667    FALSE
## 4   4 1983       3     8   0    1  1     1983  2.6666667    FALSE
## 5   5 1983       3    11   0    1  1     1983  3.6666667    FALSE
## 6   6 1983       3    14   0    1  0     1983  4.6666667    FALSE
## 7   7 1983       3    17   0    1  0     1983  5.6666667    FALSE
## 8   8 1983       3    20   0    1  1     1983  6.6666667    FALSE
## 9   9 1983       3    23   0    1  0     1983  7.6666667    FALSE
## 10 10 1983       3    26   0    1  0     1983  8.6666667    FALSE
## 11 11 1983       3    29   0    1  0     1983  9.6666667    FALSE
## 12 12 1983       3    32   0    1  0     1983 10.6666667    FALSE
## 13 13 1983       3    35   0    1  0     1983 11.6666667    FALSE
## 14 14 1983       3    38   0    1  0     1983 12.6666667    FALSE
## 15 15 1983       3    41   0    1  1     1983 13.6666667    FALSE
## 16 16 1983       4     0   0    2  2     1983  0.0000000    FALSE
## 17 17 1983       4     2   0    2  7     1983  0.6666667    FALSE
## 18 18 1983       4     5   0    2  1     1983  1.6666667    FALSE
## 19 19 1983       4     8   0    2  1     1983  2.6666667    FALSE
## 20 20 1983       4    11   0    2  1     1983  3.6666667    FALSE
## 21 21 1983       4    14   0    2  0     1983  4.6666667    FALSE
## 22 22 1983       4    17   0    2  0     1983  5.6666667    FALSE
## 23 23 1983       4    20   0    2  0     1983  6.6666667    FALSE
## 24 24 1983       4    23   0    2  0     1983  7.6666667    FALSE
## 25 25 1983       4    26   0    2  0     1983  8.6666667    FALSE
## 26 26 1983       4    29   0    2  0     1983  9.6666667    FALSE
## 27 27 1983       4    32   0    2  0     1983 10.6666667    FALSE
## 28 28 1983       4    35   0    2  0     1983 11.6666667    FALSE
## 29 29 1983       4    38   0    2  0     1983 12.6666667    FALSE
## 30 30 1983       4    41   0    2  0     1983 13.6666667    FALSE
## 31 31 1984       1     0   0    3  4     1984  0.0000000    FALSE
## 32 32 1984       1     2   0    3  4     1984  0.6666667    FALSE
## 33 33 1984       1     5   0    3  0     1984  1.6666667    FALSE
## 34 34 1984       1     8   0    3  1     1984  2.6666667    FALSE
## 35 35 1984       1    11   0    3  0     1984  3.6666667    FALSE
## 36 36 1984       1    14   0    3  2     1984  4.6666667    FALSE
## 37 37 1984       1    17   0    3  0     1984  5.6666667    FALSE
## 38 38 1984       1    20   0    3  0     1984  6.6666667    FALSE
## 39 39 1984       1    23   0    3  0     1984  7.6666667    FALSE
## 40 40 1984       1    26   0    3  0     1984  8.6666667    FALSE
## 41 41 1984       1    29   0    3  2     1984  9.6666667    FALSE
## 42 42 1984       1    32   0    3  1     1984 10.6666667    FALSE
## 43 43 1984       1    35   0    3  0     1984 11.6666667    FALSE
## 44 44 1984       1    38   0    3  0     1984 12.6666667    FALSE
## 45 45 1984       1    41   0    3  0     1984 13.6666667    FALSE
## 46 46 1984       2     0   0    4  0     1984  0.0000000    FALSE
## 47 47 1984       2     2   0    4 10     1984  0.6666667    FALSE
## 48 48 1984       2     5   0    4  0     1984  1.6666667    FALSE
## 49 49 1984       2     8   0    4  1     1984  2.6666667    FALSE
## 50 50 1984       2    11   0    4  1     1984  3.6666667    FALSE
## checking if any year variable is missing

any(is.na(MyData$year))
## [1] FALSE
## subsetting data for year 1984 only; and where we have Aids cases

MyData1984<-subset(MyData,year==1984)

MyData1984
##     X year quarter delay dud time  y yearChar     delayQ Censored
## 31 31 1984       1     0   0    3  4     1984  0.0000000    FALSE
## 32 32 1984       1     2   0    3  4     1984  0.6666667    FALSE
## 33 33 1984       1     5   0    3  0     1984  1.6666667    FALSE
## 34 34 1984       1     8   0    3  1     1984  2.6666667    FALSE
## 35 35 1984       1    11   0    3  0     1984  3.6666667    FALSE
## 36 36 1984       1    14   0    3  2     1984  4.6666667    FALSE
## 37 37 1984       1    17   0    3  0     1984  5.6666667    FALSE
## 38 38 1984       1    20   0    3  0     1984  6.6666667    FALSE
## 39 39 1984       1    23   0    3  0     1984  7.6666667    FALSE
## 40 40 1984       1    26   0    3  0     1984  8.6666667    FALSE
## 41 41 1984       1    29   0    3  2     1984  9.6666667    FALSE
## 42 42 1984       1    32   0    3  1     1984 10.6666667    FALSE
## 43 43 1984       1    35   0    3  0     1984 11.6666667    FALSE
## 44 44 1984       1    38   0    3  0     1984 12.6666667    FALSE
## 45 45 1984       1    41   0    3  0     1984 13.6666667    FALSE
## 46 46 1984       2     0   0    4  0     1984  0.0000000    FALSE
## 47 47 1984       2     2   0    4 10     1984  0.6666667    FALSE
## 48 48 1984       2     5   0    4  0     1984  1.6666667    FALSE
## 49 49 1984       2     8   0    4  1     1984  2.6666667    FALSE
## 50 50 1984       2    11   0    4  1     1984  3.6666667    FALSE
## 51 51 1984       2    14   0    4  0     1984  4.6666667    FALSE
## 52 52 1984       2    17   0    4  0     1984  5.6666667    FALSE
## 53 53 1984       2    20   0    4  0     1984  6.6666667    FALSE
## 54 54 1984       2    23   0    4  1     1984  7.6666667    FALSE
## 55 55 1984       2    26   0    4  1     1984  8.6666667    FALSE
## 56 56 1984       2    29   0    4  1     1984  9.6666667    FALSE
## 57 57 1984       2    32   0    4  0     1984 10.6666667    FALSE
## 58 58 1984       2    35   0    4  0     1984 11.6666667    FALSE
## 59 59 1984       2    38   0    4  0     1984 12.6666667    FALSE
## 60 60 1984       2    41   0    4  0     1984 13.6666667    FALSE
## 61 61 1984       3     0   0    5  6     1984  0.0000000    FALSE
## 62 62 1984       3     2   0    5 17     1984  0.6666667    FALSE
## 63 63 1984       3     5   0    5  3     1984  1.6666667    FALSE
## 64 64 1984       3     8   0    5  1     1984  2.6666667    FALSE
## 65 65 1984       3    11   0    5  1     1984  3.6666667    FALSE
## 66 66 1984       3    14   0    5  0     1984  4.6666667    FALSE
## 67 67 1984       3    17   0    5  0     1984  5.6666667    FALSE
## 68 68 1984       3    20   0    5  0     1984  6.6666667    FALSE
## 69 69 1984       3    23   0    5  0     1984  7.6666667    FALSE
## 70 70 1984       3    26   0    5  0     1984  8.6666667    FALSE
## 71 71 1984       3    29   0    5  0     1984  9.6666667    FALSE
## 72 72 1984       3    32   0    5  1     1984 10.6666667    FALSE
## 73 73 1984       3    35   0    5  0     1984 11.6666667    FALSE
## 74 74 1984       3    38   0    5  0     1984 12.6666667    FALSE
## 75 75 1984       3    41   0    5  1     1984 13.6666667    FALSE
## 76 76 1984       4     0   0    6  5     1984  0.0000000    FALSE
## 77 77 1984       4     2   0    6 22     1984  0.6666667    FALSE
## 78 78 1984       4     5   0    6  1     1984  1.6666667    FALSE
## 79 79 1984       4     8   0    6  5     1984  2.6666667    FALSE
## 80 80 1984       4    11   0    6  2     1984  3.6666667    FALSE
## 81 81 1984       4    14   0    6  1     1984  4.6666667    FALSE
## 82 82 1984       4    17   0    6  0     1984  5.6666667    FALSE
## 83 83 1984       4    20   0    6  2     1984  6.6666667    FALSE
## 84 84 1984       4    23   0    6  1     1984  7.6666667    FALSE
## 85 85 1984       4    26   0    6  0     1984  8.6666667    FALSE
## 86 86 1984       4    29   0    6  0     1984  9.6666667    FALSE
## 87 87 1984       4    32   0    6  0     1984 10.6666667    FALSE
## 88 88 1984       4    35   0    6  0     1984 11.6666667    FALSE
## 89 89 1984       4    38   0    6  0     1984 12.6666667    FALSE
## 90 90 1984       4    41   0    6  0     1984 13.6666667    FALSE
MyDataNot0<-subset(MyData,y!=0)

head(MyDataNot0,n=25)
##     X year quarter delay dud time  y yearChar     delayQ Censored
## 1   1 1983       3     0   0    1  2     1983  0.0000000    FALSE
## 2   2 1983       3     2   0    1  6     1983  0.6666667    FALSE
## 4   4 1983       3     8   0    1  1     1983  2.6666667    FALSE
## 5   5 1983       3    11   0    1  1     1983  3.6666667    FALSE
## 8   8 1983       3    20   0    1  1     1983  6.6666667    FALSE
## 15 15 1983       3    41   0    1  1     1983 13.6666667    FALSE
## 16 16 1983       4     0   0    2  2     1983  0.0000000    FALSE
## 17 17 1983       4     2   0    2  7     1983  0.6666667    FALSE
## 18 18 1983       4     5   0    2  1     1983  1.6666667    FALSE
## 19 19 1983       4     8   0    2  1     1983  2.6666667    FALSE
## 20 20 1983       4    11   0    2  1     1983  3.6666667    FALSE
## 31 31 1984       1     0   0    3  4     1984  0.0000000    FALSE
## 32 32 1984       1     2   0    3  4     1984  0.6666667    FALSE
## 34 34 1984       1     8   0    3  1     1984  2.6666667    FALSE
## 36 36 1984       1    14   0    3  2     1984  4.6666667    FALSE
## 41 41 1984       1    29   0    3  2     1984  9.6666667    FALSE
## 42 42 1984       1    32   0    3  1     1984 10.6666667    FALSE
## 47 47 1984       2     2   0    4 10     1984  0.6666667    FALSE
## 49 49 1984       2     8   0    4  1     1984  2.6666667    FALSE
## 50 50 1984       2    11   0    4  1     1984  3.6666667    FALSE
## 54 54 1984       2    23   0    4  1     1984  7.6666667    FALSE
## 55 55 1984       2    26   0    4  1     1984  8.6666667    FALSE
## 56 56 1984       2    29   0    4  1     1984  9.6666667    FALSE
## 61 61 1984       3     0   0    5  6     1984  0.0000000    FALSE
## 62 62 1984       3     2   0    5 17     1984  0.6666667    FALSE
## Assigning names to variables

names(MyData)<-c("RecordCount","DiagYear","DiagQuarter","DelaysMonths","CensorFlag","TimeSpan","PatientCount","YearChar","DelaysQtr","CensorFlag1")

head(MyData,n=25)
##    RecordCount DiagYear DiagQuarter DelaysMonths CensorFlag TimeSpan
## 1            1     1983           3            0          0        1
## 2            2     1983           3            2          0        1
## 3            3     1983           3            5          0        1
## 4            4     1983           3            8          0        1
## 5            5     1983           3           11          0        1
## 6            6     1983           3           14          0        1
## 7            7     1983           3           17          0        1
## 8            8     1983           3           20          0        1
## 9            9     1983           3           23          0        1
## 10          10     1983           3           26          0        1
## 11          11     1983           3           29          0        1
## 12          12     1983           3           32          0        1
## 13          13     1983           3           35          0        1
## 14          14     1983           3           38          0        1
## 15          15     1983           3           41          0        1
## 16          16     1983           4            0          0        2
## 17          17     1983           4            2          0        2
## 18          18     1983           4            5          0        2
## 19          19     1983           4            8          0        2
## 20          20     1983           4           11          0        2
## 21          21     1983           4           14          0        2
## 22          22     1983           4           17          0        2
## 23          23     1983           4           20          0        2
## 24          24     1983           4           23          0        2
## 25          25     1983           4           26          0        2
##    PatientCount YearChar  DelaysQtr CensorFlag1
## 1             2     1983  0.0000000       FALSE
## 2             6     1983  0.6666667       FALSE
## 3             0     1983  1.6666667       FALSE
## 4             1     1983  2.6666667       FALSE
## 5             1     1983  3.6666667       FALSE
## 6             0     1983  4.6666667       FALSE
## 7             0     1983  5.6666667       FALSE
## 8             1     1983  6.6666667       FALSE
## 9             0     1983  7.6666667       FALSE
## 10            0     1983  8.6666667       FALSE
## 11            0     1983  9.6666667       FALSE
## 12            0     1983 10.6666667       FALSE
## 13            0     1983 11.6666667       FALSE
## 14            0     1983 12.6666667       FALSE
## 15            1     1983 13.6666667       FALSE
## 16            2     1983  0.0000000       FALSE
## 17            7     1983  0.6666667       FALSE
## 18            1     1983  1.6666667       FALSE
## 19            1     1983  2.6666667       FALSE
## 20            1     1983  3.6666667       FALSE
## 21            0     1983  4.6666667       FALSE
## 22            0     1983  5.6666667       FALSE
## 23            0     1983  6.6666667       FALSE
## 24            0     1983  7.6666667       FALSE
## 25            0     1983  8.6666667       FALSE
# Question 3

## Basic Histograms

library(ggplot2)

hist(MyData$DiagYear,main="Diag Year Distribution",xlab="Diag Year")

## Year 1983 has more records

## creating new variable YearQtr

MyData$YearQtr<-MyData$DiagYear*100+MyData$DiagQuarter

head(MyData,n=10)
##    RecordCount DiagYear DiagQuarter DelaysMonths CensorFlag TimeSpan
## 1            1     1983           3            0          0        1
## 2            2     1983           3            2          0        1
## 3            3     1983           3            5          0        1
## 4            4     1983           3            8          0        1
## 5            5     1983           3           11          0        1
## 6            6     1983           3           14          0        1
## 7            7     1983           3           17          0        1
## 8            8     1983           3           20          0        1
## 9            9     1983           3           23          0        1
## 10          10     1983           3           26          0        1
##    PatientCount YearChar DelaysQtr CensorFlag1 YearQtr
## 1             2     1983 0.0000000       FALSE  198303
## 2             6     1983 0.6666667       FALSE  198303
## 3             0     1983 1.6666667       FALSE  198303
## 4             1     1983 2.6666667       FALSE  198303
## 5             1     1983 3.6666667       FALSE  198303
## 6             0     1983 4.6666667       FALSE  198303
## 7             0     1983 5.6666667       FALSE  198303
## 8             1     1983 6.6666667       FALSE  198303
## 9             0     1983 7.6666667       FALSE  198303
## 10            0     1983 8.6666667       FALSE  198303
hist(MyData$YearQtr,main="Diag Year/Qtr Distribution",xlab="Diag Year/Qtr")

plot(DiagYear~CensorFlag,data=MyData)

## we can see that data up to 1989 is completed

boxplot(MyData$PatientCount)

# Question 4
## doing sum basic summarization by year, to see any anamolies

aggregate(PatientCount~DiagYear,MyData,sum)
##    DiagYear PatientCount
## 1      1983           24
## 2      1984           98
## 3      1985          215
## 4      1986          431
## 5      1987          601
## 6      1988          814
## 7      1989          929
## 8      1990         1071
## 9      1991         1098
## 10     1992         1034
## we see number of patients grows up to 1990 and then it stabilizes, which is consistent with Aids epidemic

## creating new variable DelayPatient

MyData$DelayPatient<-MyData$DelaysMonth*MyData$PatientCount

MySum<-aggregate(cbind(DelayPatient,PatientCount)~DiagYear,MyData,sum)

MySum
##    DiagYear DelayPatient PatientCount
## 1      1983          130           24
## 2      1984          580           98
## 3      1985         1710          215
## 4      1986         3422          431
## 5      1987         3941          601
## 6      1988         5643          814
## 7      1989         5958          929
## 8      1990         5034         1071
## 9      1991         3852         1098
## 10     1992         1990         1034
## create average

MySum$Ave<-MySum$DelayPatient/MySum$PatientCount

MySum
##    DiagYear DelayPatient PatientCount      Ave
## 1      1983          130           24 5.416667
## 2      1984          580           98 5.918367
## 3      1985         1710          215 7.953488
## 4      1986         3422          431 7.939675
## 5      1987         3941          601 6.557404
## 6      1988         5643          814 6.932432
## 7      1989         5958          929 6.413348
## 8      1990         5034         1071 4.700280
## 9      1991         3852         1098 3.508197
## 10     1992         1990         1034 1.924565
### Delays increased in 1985/1986; possibly system got overwhelmed with number of new cases; then  we see drop in delays; however we should take 1990-1992 years with grain of salt; data is apparently incomplete

## Next step could be to do the same for quarter to see if there are any seasonality or for year/quarter combination to see if more data points will give us better picture of what takes place; at this point it makes sense to create a function that performance the steps

MyAve<-function(MyVar)
{
  MySum<-aggregate(cbind(DelayPatient,PatientCount)~MyVar,MyData,sum)


## create average

MySum$Ave<-MySum$DelayPatient/MySum$PatientCount

MySum
}

MyAve(MyData$DiagQuarter)
##   MyVar DelayPatient PatientCount      Ave
## 1     1         8705         1537 5.663630
## 2     2         8197         1572 5.214377
## 3     3         8109         1683 4.818182
## 4     4         7249         1523 4.759685
### It appears that beginning of the year has shorter delays than the second half; difference is not very big, so we probably should not jump to conclusion

MyAve(MyData$YearQtr)
##     MyVar DelayPatient PatientCount       Ave
## 1  198303           92           12 7.6666667
## 2  198304           38           12 3.1666667
## 3  198401          134           14 9.5714286
## 4  198402          117           15 7.8000000
## 5  198403          141           30 4.7000000
## 6  198404          188           39 4.8205128
## 7  198501          350           47 7.4468085
## 8  198502          280           40 7.0000000
## 9  198503          552           63 8.7619048
## 10 198504          528           65 8.1230769
## 11 198601          763           82 9.3048780
## 12 198602         1057          120 8.8083333
## 13 198603          909          109 8.3394495
## 14 198604          693          120 5.7750000
## 15 198701         1072          134 8.0000000
## 16 198702          908          141 6.4397163
## 17 198703          789          153 5.1568627
## 18 198704         1172          173 6.7745665
## 19 198801         1291          174 7.4195402
## 20 198802         1495          211 7.0853081
## 21 198803         1463          224 6.5312500
## 22 198804         1394          205 6.8000000
## 23 198901         1443          224 6.4419643
## 24 198902         1641          219 7.4931507
## 25 198903         1605          253 6.3438735
## 26 198904         1269          233 5.4463519
## 27 199001         1670          281 5.9430605
## 28 199002         1125          245 4.5918367
## 29 199003         1124          260 4.3230769
## 30 199004         1115          285 3.9122807
## 31 199101         1165          271 4.2988930
## 32 199102          975          263 3.7072243
## 33 199103          992          306 3.2418301
## 34 199104          720          258 2.7906977
## 35 199201          817          310 2.6354839
## 36 199202          599          318 1.8836478
## 37 199203          442          273 1.6190476
## 38 199204          132          133 0.9924812
### The result confirms that the worst delays were from the 3rd quarter '85 to 1st quarter '87; 1st Q '84 and 4th Q '86 appear to be an outlier; from 4th Q '89 forward incompletness of data makes conclusion difficult

# Question 5; reading csv file from gethib 

library(RCurl)
## Loading required package: bitops
x <- getURL("https://raw.githubusercontent.com/mgroysman/CSV-File/master/aids.csv")

y <- read.csv(text = x)

head(y,n=10)
##     X year quarter delay dud time y
## 1   1 1983       3     0   0    1 2
## 2   2 1983       3     2   0    1 6
## 3   3 1983       3     5   0    1 0
## 4   4 1983       3     8   0    1 1
## 5   5 1983       3    11   0    1 1
## 6   6 1983       3    14   0    1 0
## 7   7 1983       3    17   0    1 0
## 8   8 1983       3    20   0    1 1
## 9   9 1983       3    23   0    1 0
## 10 10 1983       3    26   0    1 0