library(readr)
schoolData <- read.csv("https://raw.githubusercontent.com/juliaDataScience-22/cuny-summer-23/main/Early.csv")
View(schoolData)
schoolData <- schoolData[,3:5]
head(schoolData)
##   cog age trt
## 1 103 1.0   Y
## 2 119 1.5   Y
## 3  96 2.0   Y
## 4 106 1.0   Y
## 5 107 1.5   Y
## 6  96 2.0   Y

Question: Did the infants in the early intervention program perform significantly better on the cognitive test than infants not in the program? If so, did the length of time the infants spent in the program correlate with a higher cognitive score?

#Data exploration
summary(schoolData)
##       cog             age          trt           
##  Min.   : 57.0   Min.   :1.0   Length:309        
##  1st Qu.: 92.0   1st Qu.:1.0   Class :character  
##  Median :103.0   Median :1.5   Mode  :character  
##  Mean   :102.6   Mean   :1.5                     
##  3rd Qu.:113.0   3rd Qu.:2.0                     
##  Max.   :137.0   Max.   :2.0
mean(schoolData$cog)
## [1] 102.6181
mean(schoolData$age)
## [1] 1.5
median(schoolData$cog)
## [1] 103
median(schoolData$age)
## [1] 1.5
sd(schoolData$cog)
## [1] 15.70517
sd(schoolData$age)
## [1] 0.4089105
var(schoolData$cog)
## [1] 246.6524
var(schoolData$age)
## [1] 0.1672078
library(DescTools)
Mode(schoolData$cog)
## [1] 96
## attr(,"freq")
## [1] 22
Mode(schoolData$age)
## [1] 1.0 1.5 2.0
## attr(,"freq")
## [1] 103
Mode(schoolData$trt)
## [1] "Y"
## attr(,"freq")
## [1] 174
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
schoolData %>% count(cog)
##    cog  n
## 1   57  1
## 2   66  2
## 3   68  1
## 4   70  1
## 5   72  1
## 6   73  2
## 7   76  5
## 8   77  9
## 9   78  1
## 10  79  1
## 11  80  2
## 12  81  6
## 13  83  3
## 14  84  4
## 15  85  9
## 16  86  1
## 17  87  8
## 18  88  7
## 19  89  3
## 20  90  4
## 21  91  2
## 22  92 12
## 23  93  8
## 24  94  1
## 25  96 22
## 26  98  8
## 27  99  2
## 28 100 20
## 29 102  2
## 30 103 11
## 31 104 10
## 32 105 12
## 33 106 10
## 34 107  2
## 35 108  4
## 36 109 10
## 37 110  1
## 38 111  5
## 39 112 17
## 40 113  2
## 41 114  1
## 42 115  6
## 43 116  5
## 44 117 10
## 45 118  1
## 46 119  9
## 47 120  1
## 48 121  4
## 49 122  7
## 50 123  3
## 51 124  2
## 52 126  5
## 53 127  1
## 54 128  7
## 55 130  2
## 56 131  4
## 57 132  2
## 58 134  3
## 59 136  2
## 60 137  2
schoolData %>% count(age)
##   age   n
## 1 1.0 103
## 2 1.5 103
## 3 2.0 103
schoolData %>% count(trt)
##   trt   n
## 1   N 135
## 2   Y 174

Conclusions:

The lowest cognitive score was 57.0, and the highest cognitive score was 137.0. The range of ages was 1 to 2 years. The average cognitive score was 102.6, and the average age was 1.5 years. The standard deviation of the cogntiive score variable was about 15.7. The variance of the cognitive score variable was 246.7. Both the standard deviation and variance was large for the cognitive score. The standard deviation and variance of age was low. The most common cognitive score was 96. All three ages - 1, 1.5, and 2 - appeared equally in the data. More infants were in the treatment program than not. The total number of infants was 309.

#Data Wrangling
colnames(schoolData) <- c("Cognitive_Score", "Age", "Intervention")
schoolData$Time_In_Program <- schoolData$Age - 0.5
schoolData$Intervention <- sub("Y", "Yes", schoolData$Intervention)
schoolData$Intervention <- sub("N", "No", schoolData$Intervention)
schoolData$Time_In_Program <- ifelse(schoolData$Intervention == "No", 0, schoolData$Time_In_Program)
head(schoolData)
##   Cognitive_Score Age Intervention Time_In_Program
## 1             103 1.0          Yes             0.5
## 2             119 1.5          Yes             1.0
## 3              96 2.0          Yes             1.5
## 4             106 1.0          Yes             0.5
## 5             107 1.5          Yes             1.0
## 6              96 2.0          Yes             1.5

In the section above, the columns were renamed, a new column was added to show how long the infants were in the intervention program, the values for intervention were changed to words instead of letters, and infants who were not in the program had their Time_In_Program changed to 0.

library(ggplot2)
ggplot(schoolData, aes(x=Intervention, y=Cognitive_Score)) + 
    geom_point()

ggplot(schoolData, aes(x=Time_In_Program, y=Cognitive_Score)) + 
    geom_point()

ggplot(schoolData, aes(x=Age, y=Cognitive_Score)) + 
    geom_point()

ggplot(data=schoolData, mapping=aes(x=Time_In_Program, y=Cognitive_Score, group=Time_In_Program), outlier.color='blue')+geom_boxplot()

ggplot(schoolData, aes(x=Cognitive_Score)) + geom_histogram() #This histogram shows about normal distribution of scores.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

schoolData$Intervention <- ifelse(schoolData$Intervention == "Yes", 1, 0)

cor(schoolData)
##                 Cognitive_Score        Age Intervention Time_In_Program
## Cognitive_Score      1.00000000 -0.4729575    0.3002091      0.09256145
## Age                 -0.47295753  1.0000000    0.0000000      0.39432978
## Intervention         0.30020913  0.0000000    1.0000000      0.85079997
## Time_In_Program      0.09256145  0.3943298    0.8508000      1.00000000

Analysis and Conclusion:

Based on the original scatterplot of Intervention compared to Cognitive_Score, the intervention did not correlate at all with the cognitive score. I also found the correlation, and it was 0.3, which shows it was not significant. I decided to create a box plot and other scatterplots to determine if age or time in the program were factors, and nothing showed any significant difference. The correlation values agreed that there was no significant correlation. The program did not seem to have any affect on the outcome of the cognitive tests. Two possible explanations could be environmental factors or genetics.

-

Therefore: The infants in the early intervention program did not perform significantly better on the cognitive test than infants who were not in the program. Also, the length of time in the program did not correlate with a higher cognitive score.