title: “Final Project” author: “Mikhail Groysman” date: “July 25, 2018” output: html_document
# Data: Delay in AIDS Reporting in England and Wales
## Taken from: http://vincentarelbundock.github.io/Rdatasets/csv/datasets/aids.csv
### GitHub Raw file location: https://raw.githubusercontent.com/dvillalobos/CUNY-Bridge/master/aids.csv
# Description: Aids Cases diagnosed from July 1983 to Dec 1992 in England and Wales and their reporting dates as well as time between these two dates. Data is used to estimate delays in reporting of Aids cases to Communicable Disease Surveillance Centre.
# Data format
## X - integer - counter
## year - integer - year of diagnosis
## quarter - integer - quarter of diagnosis
## delay - integer - months of delay between reporting and diagnosis; longer delays grouped in 3 months intervals
## dud - integer - indicator of censored data - full information is not available
## time - integer - number of quarters between July 1983 and diagnosis date
## y - integer - number of Aids cases reported
# Source of data
## The data were obtained from De Angelis, D. and Gilks, W.R. (1994) Estimating acquired immune
## deficiency syndrome accounting for reporting delay.
## Journal of the Royal Statistical Society, A, 157, 31;40.
#Question 1
## reading csv file from my desktop
MyData <- read.csv(file="c:/Users/Dell/Desktop/Aids.csv", header=TRUE, sep=",")
head(MyData,n=50)
## X year quarter delay dud time y
## 1 1 1983 3 0 0 1 2
## 2 2 1983 3 2 0 1 6
## 3 3 1983 3 5 0 1 0
## 4 4 1983 3 8 0 1 1
## 5 5 1983 3 11 0 1 1
## 6 6 1983 3 14 0 1 0
## 7 7 1983 3 17 0 1 0
## 8 8 1983 3 20 0 1 1
## 9 9 1983 3 23 0 1 0
## 10 10 1983 3 26 0 1 0
## 11 11 1983 3 29 0 1 0
## 12 12 1983 3 32 0 1 0
## 13 13 1983 3 35 0 1 0
## 14 14 1983 3 38 0 1 0
## 15 15 1983 3 41 0 1 1
## 16 16 1983 4 0 0 2 2
## 17 17 1983 4 2 0 2 7
## 18 18 1983 4 5 0 2 1
## 19 19 1983 4 8 0 2 1
## 20 20 1983 4 11 0 2 1
## 21 21 1983 4 14 0 2 0
## 22 22 1983 4 17 0 2 0
## 23 23 1983 4 20 0 2 0
## 24 24 1983 4 23 0 2 0
## 25 25 1983 4 26 0 2 0
## 26 26 1983 4 29 0 2 0
## 27 27 1983 4 32 0 2 0
## 28 28 1983 4 35 0 2 0
## 29 29 1983 4 38 0 2 0
## 30 30 1983 4 41 0 2 0
## 31 31 1984 1 0 0 3 4
## 32 32 1984 1 2 0 3 4
## 33 33 1984 1 5 0 3 0
## 34 34 1984 1 8 0 3 1
## 35 35 1984 1 11 0 3 0
## 36 36 1984 1 14 0 3 2
## 37 37 1984 1 17 0 3 0
## 38 38 1984 1 20 0 3 0
## 39 39 1984 1 23 0 3 0
## 40 40 1984 1 26 0 3 0
## 41 41 1984 1 29 0 3 2
## 42 42 1984 1 32 0 3 1
## 43 43 1984 1 35 0 3 0
## 44 44 1984 1 38 0 3 0
## 45 45 1984 1 41 0 3 0
## 46 46 1984 2 0 0 4 0
## 47 47 1984 2 2 0 4 10
## 48 48 1984 2 5 0 4 0
## 49 49 1984 2 8 0 4 1
## 50 50 1984 2 11 0 4 1
## getting summary of dataframe
summary(MyData)
## X year quarter delay
## Min. : 1.0 Min. :1983 Min. :1.000 Min. : 0.00
## 1st Qu.:143.2 1st Qu.:1985 1st Qu.:2.000 1st Qu.: 8.00
## Median :285.5 Median :1988 Median :3.000 Median :20.00
## Mean :285.5 Mean :1988 Mean :2.553 Mean :20.07
## 3rd Qu.:427.8 3rd Qu.:1990 3rd Qu.:4.000 3rd Qu.:32.00
## Max. :570.0 Max. :1992 Max. :4.000 Max. :41.00
## dud time y
## Min. :0.0000 Min. : 1.0 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:10.0 1st Qu.: 0.00
## Median :0.0000 Median :19.5 Median : 2.00
## Mean :0.1842 Mean :19.5 Mean : 11.08
## 3rd Qu.:0.0000 3rd Qu.:29.0 3rd Qu.: 8.00
## Max. :1.0000 Max. :38.0 Max. :181.00
## basic mean and medians of variables
mean(MyData$year)
## [1] 1987.737
mean(MyData$delay)
## [1] 20.06667
median(MyData$dud)
## [1] 0
median(MyData$y)
## [1] 2
## checking data type of variables
typeof(MyData$year)
## [1] "integer"
is.numeric(MyData$quarter)
## [1] TRUE
## not very useful transformation
MyData$yearChar<-as.character(MyData$year)
head(MyData$yearChar,n=50)
## [1] "1983" "1983" "1983" "1983" "1983" "1983" "1983" "1983" "1983" "1983"
## [11] "1983" "1983" "1983" "1983" "1983" "1983" "1983" "1983" "1983" "1983"
## [21] "1983" "1983" "1983" "1983" "1983" "1983" "1983" "1983" "1983" "1983"
## [31] "1984" "1984" "1984" "1984" "1984" "1984" "1984" "1984" "1984" "1984"
## [41] "1984" "1984" "1984" "1984" "1984" "1984" "1984" "1984" "1984" "1984"
## check dud data type
class(MyData$dud)
## [1] "integer"
### Some basic conclusions. Year records are pretty evenly distributed. Quarter records are slightly skewed toward 2nd half of the year. Mean delays in quarters for records (not cases) was 20.07 quarters or ~5 years. Interestingly, mean and median are almost the same. Most of records are not censored. Mean of diagnoses in quarters for records (not cases) was 19.5 quarters or ~5 years. This makes sense. It seems that data is normally distributed. Number of cases is skewed to the right.
# Question 2
## Creating variable delays in quarters instead of months
MyData$delayQ<-MyData$delay/3
head(MyData$delayQ,n=50)
## [1] 0.0000000 0.6666667 1.6666667 2.6666667 3.6666667 4.6666667
## [7] 5.6666667 6.6666667 7.6666667 8.6666667 9.6666667 10.6666667
## [13] 11.6666667 12.6666667 13.6666667 0.0000000 0.6666667 1.6666667
## [19] 2.6666667 3.6666667 4.6666667 5.6666667 6.6666667 7.6666667
## [25] 8.6666667 9.6666667 10.6666667 11.6666667 12.6666667 13.6666667
## [31] 0.0000000 0.6666667 1.6666667 2.6666667 3.6666667 4.6666667
## [37] 5.6666667 6.6666667 7.6666667 8.6666667 9.6666667 10.6666667
## [43] 11.6666667 12.6666667 13.6666667 0.0000000 0.6666667 1.6666667
## [49] 2.6666667 3.6666667
## Creating variable "Censored" FALSE/TRUE
MyData$Censored<-(MyData$dud==1)
head(MyData$Censored,n=40)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
head(MyData,n=50)
## X year quarter delay dud time y yearChar delayQ Censored
## 1 1 1983 3 0 0 1 2 1983 0.0000000 FALSE
## 2 2 1983 3 2 0 1 6 1983 0.6666667 FALSE
## 3 3 1983 3 5 0 1 0 1983 1.6666667 FALSE
## 4 4 1983 3 8 0 1 1 1983 2.6666667 FALSE
## 5 5 1983 3 11 0 1 1 1983 3.6666667 FALSE
## 6 6 1983 3 14 0 1 0 1983 4.6666667 FALSE
## 7 7 1983 3 17 0 1 0 1983 5.6666667 FALSE
## 8 8 1983 3 20 0 1 1 1983 6.6666667 FALSE
## 9 9 1983 3 23 0 1 0 1983 7.6666667 FALSE
## 10 10 1983 3 26 0 1 0 1983 8.6666667 FALSE
## 11 11 1983 3 29 0 1 0 1983 9.6666667 FALSE
## 12 12 1983 3 32 0 1 0 1983 10.6666667 FALSE
## 13 13 1983 3 35 0 1 0 1983 11.6666667 FALSE
## 14 14 1983 3 38 0 1 0 1983 12.6666667 FALSE
## 15 15 1983 3 41 0 1 1 1983 13.6666667 FALSE
## 16 16 1983 4 0 0 2 2 1983 0.0000000 FALSE
## 17 17 1983 4 2 0 2 7 1983 0.6666667 FALSE
## 18 18 1983 4 5 0 2 1 1983 1.6666667 FALSE
## 19 19 1983 4 8 0 2 1 1983 2.6666667 FALSE
## 20 20 1983 4 11 0 2 1 1983 3.6666667 FALSE
## 21 21 1983 4 14 0 2 0 1983 4.6666667 FALSE
## 22 22 1983 4 17 0 2 0 1983 5.6666667 FALSE
## 23 23 1983 4 20 0 2 0 1983 6.6666667 FALSE
## 24 24 1983 4 23 0 2 0 1983 7.6666667 FALSE
## 25 25 1983 4 26 0 2 0 1983 8.6666667 FALSE
## 26 26 1983 4 29 0 2 0 1983 9.6666667 FALSE
## 27 27 1983 4 32 0 2 0 1983 10.6666667 FALSE
## 28 28 1983 4 35 0 2 0 1983 11.6666667 FALSE
## 29 29 1983 4 38 0 2 0 1983 12.6666667 FALSE
## 30 30 1983 4 41 0 2 0 1983 13.6666667 FALSE
## 31 31 1984 1 0 0 3 4 1984 0.0000000 FALSE
## 32 32 1984 1 2 0 3 4 1984 0.6666667 FALSE
## 33 33 1984 1 5 0 3 0 1984 1.6666667 FALSE
## 34 34 1984 1 8 0 3 1 1984 2.6666667 FALSE
## 35 35 1984 1 11 0 3 0 1984 3.6666667 FALSE
## 36 36 1984 1 14 0 3 2 1984 4.6666667 FALSE
## 37 37 1984 1 17 0 3 0 1984 5.6666667 FALSE
## 38 38 1984 1 20 0 3 0 1984 6.6666667 FALSE
## 39 39 1984 1 23 0 3 0 1984 7.6666667 FALSE
## 40 40 1984 1 26 0 3 0 1984 8.6666667 FALSE
## 41 41 1984 1 29 0 3 2 1984 9.6666667 FALSE
## 42 42 1984 1 32 0 3 1 1984 10.6666667 FALSE
## 43 43 1984 1 35 0 3 0 1984 11.6666667 FALSE
## 44 44 1984 1 38 0 3 0 1984 12.6666667 FALSE
## 45 45 1984 1 41 0 3 0 1984 13.6666667 FALSE
## 46 46 1984 2 0 0 4 0 1984 0.0000000 FALSE
## 47 47 1984 2 2 0 4 10 1984 0.6666667 FALSE
## 48 48 1984 2 5 0 4 0 1984 1.6666667 FALSE
## 49 49 1984 2 8 0 4 1 1984 2.6666667 FALSE
## 50 50 1984 2 11 0 4 1 1984 3.6666667 FALSE
## checking if any year variable is missing
any(is.na(MyData$year))
## [1] FALSE
## subsetting data for year 1984 only; and where we have Aids cases
MyData1984<-subset(MyData,year==1984)
MyData1984
## X year quarter delay dud time y yearChar delayQ Censored
## 31 31 1984 1 0 0 3 4 1984 0.0000000 FALSE
## 32 32 1984 1 2 0 3 4 1984 0.6666667 FALSE
## 33 33 1984 1 5 0 3 0 1984 1.6666667 FALSE
## 34 34 1984 1 8 0 3 1 1984 2.6666667 FALSE
## 35 35 1984 1 11 0 3 0 1984 3.6666667 FALSE
## 36 36 1984 1 14 0 3 2 1984 4.6666667 FALSE
## 37 37 1984 1 17 0 3 0 1984 5.6666667 FALSE
## 38 38 1984 1 20 0 3 0 1984 6.6666667 FALSE
## 39 39 1984 1 23 0 3 0 1984 7.6666667 FALSE
## 40 40 1984 1 26 0 3 0 1984 8.6666667 FALSE
## 41 41 1984 1 29 0 3 2 1984 9.6666667 FALSE
## 42 42 1984 1 32 0 3 1 1984 10.6666667 FALSE
## 43 43 1984 1 35 0 3 0 1984 11.6666667 FALSE
## 44 44 1984 1 38 0 3 0 1984 12.6666667 FALSE
## 45 45 1984 1 41 0 3 0 1984 13.6666667 FALSE
## 46 46 1984 2 0 0 4 0 1984 0.0000000 FALSE
## 47 47 1984 2 2 0 4 10 1984 0.6666667 FALSE
## 48 48 1984 2 5 0 4 0 1984 1.6666667 FALSE
## 49 49 1984 2 8 0 4 1 1984 2.6666667 FALSE
## 50 50 1984 2 11 0 4 1 1984 3.6666667 FALSE
## 51 51 1984 2 14 0 4 0 1984 4.6666667 FALSE
## 52 52 1984 2 17 0 4 0 1984 5.6666667 FALSE
## 53 53 1984 2 20 0 4 0 1984 6.6666667 FALSE
## 54 54 1984 2 23 0 4 1 1984 7.6666667 FALSE
## 55 55 1984 2 26 0 4 1 1984 8.6666667 FALSE
## 56 56 1984 2 29 0 4 1 1984 9.6666667 FALSE
## 57 57 1984 2 32 0 4 0 1984 10.6666667 FALSE
## 58 58 1984 2 35 0 4 0 1984 11.6666667 FALSE
## 59 59 1984 2 38 0 4 0 1984 12.6666667 FALSE
## 60 60 1984 2 41 0 4 0 1984 13.6666667 FALSE
## 61 61 1984 3 0 0 5 6 1984 0.0000000 FALSE
## 62 62 1984 3 2 0 5 17 1984 0.6666667 FALSE
## 63 63 1984 3 5 0 5 3 1984 1.6666667 FALSE
## 64 64 1984 3 8 0 5 1 1984 2.6666667 FALSE
## 65 65 1984 3 11 0 5 1 1984 3.6666667 FALSE
## 66 66 1984 3 14 0 5 0 1984 4.6666667 FALSE
## 67 67 1984 3 17 0 5 0 1984 5.6666667 FALSE
## 68 68 1984 3 20 0 5 0 1984 6.6666667 FALSE
## 69 69 1984 3 23 0 5 0 1984 7.6666667 FALSE
## 70 70 1984 3 26 0 5 0 1984 8.6666667 FALSE
## 71 71 1984 3 29 0 5 0 1984 9.6666667 FALSE
## 72 72 1984 3 32 0 5 1 1984 10.6666667 FALSE
## 73 73 1984 3 35 0 5 0 1984 11.6666667 FALSE
## 74 74 1984 3 38 0 5 0 1984 12.6666667 FALSE
## 75 75 1984 3 41 0 5 1 1984 13.6666667 FALSE
## 76 76 1984 4 0 0 6 5 1984 0.0000000 FALSE
## 77 77 1984 4 2 0 6 22 1984 0.6666667 FALSE
## 78 78 1984 4 5 0 6 1 1984 1.6666667 FALSE
## 79 79 1984 4 8 0 6 5 1984 2.6666667 FALSE
## 80 80 1984 4 11 0 6 2 1984 3.6666667 FALSE
## 81 81 1984 4 14 0 6 1 1984 4.6666667 FALSE
## 82 82 1984 4 17 0 6 0 1984 5.6666667 FALSE
## 83 83 1984 4 20 0 6 2 1984 6.6666667 FALSE
## 84 84 1984 4 23 0 6 1 1984 7.6666667 FALSE
## 85 85 1984 4 26 0 6 0 1984 8.6666667 FALSE
## 86 86 1984 4 29 0 6 0 1984 9.6666667 FALSE
## 87 87 1984 4 32 0 6 0 1984 10.6666667 FALSE
## 88 88 1984 4 35 0 6 0 1984 11.6666667 FALSE
## 89 89 1984 4 38 0 6 0 1984 12.6666667 FALSE
## 90 90 1984 4 41 0 6 0 1984 13.6666667 FALSE
MyDataNot0<-subset(MyData,y!=0)
head(MyDataNot0,n=25)
## X year quarter delay dud time y yearChar delayQ Censored
## 1 1 1983 3 0 0 1 2 1983 0.0000000 FALSE
## 2 2 1983 3 2 0 1 6 1983 0.6666667 FALSE
## 4 4 1983 3 8 0 1 1 1983 2.6666667 FALSE
## 5 5 1983 3 11 0 1 1 1983 3.6666667 FALSE
## 8 8 1983 3 20 0 1 1 1983 6.6666667 FALSE
## 15 15 1983 3 41 0 1 1 1983 13.6666667 FALSE
## 16 16 1983 4 0 0 2 2 1983 0.0000000 FALSE
## 17 17 1983 4 2 0 2 7 1983 0.6666667 FALSE
## 18 18 1983 4 5 0 2 1 1983 1.6666667 FALSE
## 19 19 1983 4 8 0 2 1 1983 2.6666667 FALSE
## 20 20 1983 4 11 0 2 1 1983 3.6666667 FALSE
## 31 31 1984 1 0 0 3 4 1984 0.0000000 FALSE
## 32 32 1984 1 2 0 3 4 1984 0.6666667 FALSE
## 34 34 1984 1 8 0 3 1 1984 2.6666667 FALSE
## 36 36 1984 1 14 0 3 2 1984 4.6666667 FALSE
## 41 41 1984 1 29 0 3 2 1984 9.6666667 FALSE
## 42 42 1984 1 32 0 3 1 1984 10.6666667 FALSE
## 47 47 1984 2 2 0 4 10 1984 0.6666667 FALSE
## 49 49 1984 2 8 0 4 1 1984 2.6666667 FALSE
## 50 50 1984 2 11 0 4 1 1984 3.6666667 FALSE
## 54 54 1984 2 23 0 4 1 1984 7.6666667 FALSE
## 55 55 1984 2 26 0 4 1 1984 8.6666667 FALSE
## 56 56 1984 2 29 0 4 1 1984 9.6666667 FALSE
## 61 61 1984 3 0 0 5 6 1984 0.0000000 FALSE
## 62 62 1984 3 2 0 5 17 1984 0.6666667 FALSE
## Assigning names to variables
names(MyData)<-c("RecordCount","DiagYear","DiagQuarter","DelaysMonths","CensorFlag","TimeSpan","PatientCount","YearChar","DelaysQtr","CensorFlag1")
head(MyData,n=25)
## RecordCount DiagYear DiagQuarter DelaysMonths CensorFlag TimeSpan
## 1 1 1983 3 0 0 1
## 2 2 1983 3 2 0 1
## 3 3 1983 3 5 0 1
## 4 4 1983 3 8 0 1
## 5 5 1983 3 11 0 1
## 6 6 1983 3 14 0 1
## 7 7 1983 3 17 0 1
## 8 8 1983 3 20 0 1
## 9 9 1983 3 23 0 1
## 10 10 1983 3 26 0 1
## 11 11 1983 3 29 0 1
## 12 12 1983 3 32 0 1
## 13 13 1983 3 35 0 1
## 14 14 1983 3 38 0 1
## 15 15 1983 3 41 0 1
## 16 16 1983 4 0 0 2
## 17 17 1983 4 2 0 2
## 18 18 1983 4 5 0 2
## 19 19 1983 4 8 0 2
## 20 20 1983 4 11 0 2
## 21 21 1983 4 14 0 2
## 22 22 1983 4 17 0 2
## 23 23 1983 4 20 0 2
## 24 24 1983 4 23 0 2
## 25 25 1983 4 26 0 2
## PatientCount YearChar DelaysQtr CensorFlag1
## 1 2 1983 0.0000000 FALSE
## 2 6 1983 0.6666667 FALSE
## 3 0 1983 1.6666667 FALSE
## 4 1 1983 2.6666667 FALSE
## 5 1 1983 3.6666667 FALSE
## 6 0 1983 4.6666667 FALSE
## 7 0 1983 5.6666667 FALSE
## 8 1 1983 6.6666667 FALSE
## 9 0 1983 7.6666667 FALSE
## 10 0 1983 8.6666667 FALSE
## 11 0 1983 9.6666667 FALSE
## 12 0 1983 10.6666667 FALSE
## 13 0 1983 11.6666667 FALSE
## 14 0 1983 12.6666667 FALSE
## 15 1 1983 13.6666667 FALSE
## 16 2 1983 0.0000000 FALSE
## 17 7 1983 0.6666667 FALSE
## 18 1 1983 1.6666667 FALSE
## 19 1 1983 2.6666667 FALSE
## 20 1 1983 3.6666667 FALSE
## 21 0 1983 4.6666667 FALSE
## 22 0 1983 5.6666667 FALSE
## 23 0 1983 6.6666667 FALSE
## 24 0 1983 7.6666667 FALSE
## 25 0 1983 8.6666667 FALSE
# Question 3
## Basic Histograms
library(ggplot2)
hist(MyData$DiagYear,main="Diag Year Distribution",xlab="Diag Year")
## Year 1983 has more records
## creating new variable YearQtr
MyData$YearQtr<-MyData$DiagYear*100+MyData$DiagQuarter
head(MyData,n=10)
## RecordCount DiagYear DiagQuarter DelaysMonths CensorFlag TimeSpan
## 1 1 1983 3 0 0 1
## 2 2 1983 3 2 0 1
## 3 3 1983 3 5 0 1
## 4 4 1983 3 8 0 1
## 5 5 1983 3 11 0 1
## 6 6 1983 3 14 0 1
## 7 7 1983 3 17 0 1
## 8 8 1983 3 20 0 1
## 9 9 1983 3 23 0 1
## 10 10 1983 3 26 0 1
## PatientCount YearChar DelaysQtr CensorFlag1 YearQtr
## 1 2 1983 0.0000000 FALSE 198303
## 2 6 1983 0.6666667 FALSE 198303
## 3 0 1983 1.6666667 FALSE 198303
## 4 1 1983 2.6666667 FALSE 198303
## 5 1 1983 3.6666667 FALSE 198303
## 6 0 1983 4.6666667 FALSE 198303
## 7 0 1983 5.6666667 FALSE 198303
## 8 1 1983 6.6666667 FALSE 198303
## 9 0 1983 7.6666667 FALSE 198303
## 10 0 1983 8.6666667 FALSE 198303
hist(MyData$YearQtr,main="Diag Year/Qtr Distribution",xlab="Diag Year/Qtr")
plot(DiagYear~CensorFlag,data=MyData)
## we can see that data up to 1989 is completed
boxplot(MyData$PatientCount)
# Question 4
## doing sum basic summarization by year, to see any anamolies
aggregate(PatientCount~DiagYear,MyData,sum)
## DiagYear PatientCount
## 1 1983 24
## 2 1984 98
## 3 1985 215
## 4 1986 431
## 5 1987 601
## 6 1988 814
## 7 1989 929
## 8 1990 1071
## 9 1991 1098
## 10 1992 1034
## we see number of patients grows up to 1990 and then it stabilizes, which is consistent with Aids epidemic
## creating new variable DelayPatient
MyData$DelayPatient<-MyData$DelaysMonth*MyData$PatientCount
MySum<-aggregate(cbind(DelayPatient,PatientCount)~DiagYear,MyData,sum)
MySum
## DiagYear DelayPatient PatientCount
## 1 1983 130 24
## 2 1984 580 98
## 3 1985 1710 215
## 4 1986 3422 431
## 5 1987 3941 601
## 6 1988 5643 814
## 7 1989 5958 929
## 8 1990 5034 1071
## 9 1991 3852 1098
## 10 1992 1990 1034
## create average
MySum$Ave<-MySum$DelayPatient/MySum$PatientCount
MySum
## DiagYear DelayPatient PatientCount Ave
## 1 1983 130 24 5.416667
## 2 1984 580 98 5.918367
## 3 1985 1710 215 7.953488
## 4 1986 3422 431 7.939675
## 5 1987 3941 601 6.557404
## 6 1988 5643 814 6.932432
## 7 1989 5958 929 6.413348
## 8 1990 5034 1071 4.700280
## 9 1991 3852 1098 3.508197
## 10 1992 1990 1034 1.924565
### Delays increased in 1985/1986; possibly system got overwhelmed with number of new cases; then we see drop in delays; however we should take 1990-1992 years with grain of salt; data is apparently incomplete
## Next step could be to do the same for quarter to see if there are any seasonality or for year/quarter combination to see if more data points will give us better picture of what takes place; at this point it makes sense to create a function that performance the steps
MyAve<-function(MyVar)
{
MySum<-aggregate(cbind(DelayPatient,PatientCount)~MyVar,MyData,sum)
## create average
MySum$Ave<-MySum$DelayPatient/MySum$PatientCount
MySum
}
MyAve(MyData$DiagQuarter)
## MyVar DelayPatient PatientCount Ave
## 1 1 8705 1537 5.663630
## 2 2 8197 1572 5.214377
## 3 3 8109 1683 4.818182
## 4 4 7249 1523 4.759685
### It appears that beginning of the year has shorter delays than the second half; difference is not very big, so we probably should not jump to conclusion
MyAve(MyData$YearQtr)
## MyVar DelayPatient PatientCount Ave
## 1 198303 92 12 7.6666667
## 2 198304 38 12 3.1666667
## 3 198401 134 14 9.5714286
## 4 198402 117 15 7.8000000
## 5 198403 141 30 4.7000000
## 6 198404 188 39 4.8205128
## 7 198501 350 47 7.4468085
## 8 198502 280 40 7.0000000
## 9 198503 552 63 8.7619048
## 10 198504 528 65 8.1230769
## 11 198601 763 82 9.3048780
## 12 198602 1057 120 8.8083333
## 13 198603 909 109 8.3394495
## 14 198604 693 120 5.7750000
## 15 198701 1072 134 8.0000000
## 16 198702 908 141 6.4397163
## 17 198703 789 153 5.1568627
## 18 198704 1172 173 6.7745665
## 19 198801 1291 174 7.4195402
## 20 198802 1495 211 7.0853081
## 21 198803 1463 224 6.5312500
## 22 198804 1394 205 6.8000000
## 23 198901 1443 224 6.4419643
## 24 198902 1641 219 7.4931507
## 25 198903 1605 253 6.3438735
## 26 198904 1269 233 5.4463519
## 27 199001 1670 281 5.9430605
## 28 199002 1125 245 4.5918367
## 29 199003 1124 260 4.3230769
## 30 199004 1115 285 3.9122807
## 31 199101 1165 271 4.2988930
## 32 199102 975 263 3.7072243
## 33 199103 992 306 3.2418301
## 34 199104 720 258 2.7906977
## 35 199201 817 310 2.6354839
## 36 199202 599 318 1.8836478
## 37 199203 442 273 1.6190476
## 38 199204 132 133 0.9924812
### The result confirms that the worst delays were from the 3rd quarter '85 to 1st quarter '87; 1st Q '84 and 4th Q '86 appear to be an outlier; from 4th Q '89 forward incompletness of data makes conclusion difficult
# Question 5; reading csv file from gethib
library(RCurl)
## Loading required package: bitops
x <- getURL("https://raw.githubusercontent.com/mgroysman/CSV-File/master/aids.csv")
y <- read.csv(text = x)
head(y,n=10)
## X year quarter delay dud time y
## 1 1 1983 3 0 0 1 2
## 2 2 1983 3 2 0 1 6
## 3 3 1983 3 5 0 1 0
## 4 4 1983 3 8 0 1 1
## 5 5 1983 3 11 0 1 1
## 6 6 1983 3 14 0 1 0
## 7 7 1983 3 17 0 1 0
## 8 8 1983 3 20 0 1 1
## 9 9 1983 3 23 0 1 0
## 10 10 1983 3 26 0 1 0