library(readr)
schoolData <- read.csv("https://raw.githubusercontent.com/juliaDataScience-22/cuny-summer-23/main/Early.csv")
View(schoolData)
schoolData <- schoolData[,3:5]
head(schoolData)
## cog age trt
## 1 103 1.0 Y
## 2 119 1.5 Y
## 3 96 2.0 Y
## 4 106 1.0 Y
## 5 107 1.5 Y
## 6 96 2.0 Y
Question: Did the infants in the early intervention program perform
significantly better on the cognitive test than infants not in the
program? If so, did the length of time the infants spent in the program
correlate with a higher cognitive score?
#Data exploration
summary(schoolData)
## cog age trt
## Min. : 57.0 Min. :1.0 Length:309
## 1st Qu.: 92.0 1st Qu.:1.0 Class :character
## Median :103.0 Median :1.5 Mode :character
## Mean :102.6 Mean :1.5
## 3rd Qu.:113.0 3rd Qu.:2.0
## Max. :137.0 Max. :2.0
mean(schoolData$cog)
## [1] 102.6181
mean(schoolData$age)
## [1] 1.5
median(schoolData$cog)
## [1] 103
median(schoolData$age)
## [1] 1.5
sd(schoolData$cog)
## [1] 15.70517
sd(schoolData$age)
## [1] 0.4089105
var(schoolData$cog)
## [1] 246.6524
var(schoolData$age)
## [1] 0.1672078
library(DescTools)
Mode(schoolData$cog)
## [1] 96
## attr(,"freq")
## [1] 22
Mode(schoolData$age)
## [1] 1.0 1.5 2.0
## attr(,"freq")
## [1] 103
Mode(schoolData$trt)
## [1] "Y"
## attr(,"freq")
## [1] 174
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
schoolData %>% count(cog)
## cog n
## 1 57 1
## 2 66 2
## 3 68 1
## 4 70 1
## 5 72 1
## 6 73 2
## 7 76 5
## 8 77 9
## 9 78 1
## 10 79 1
## 11 80 2
## 12 81 6
## 13 83 3
## 14 84 4
## 15 85 9
## 16 86 1
## 17 87 8
## 18 88 7
## 19 89 3
## 20 90 4
## 21 91 2
## 22 92 12
## 23 93 8
## 24 94 1
## 25 96 22
## 26 98 8
## 27 99 2
## 28 100 20
## 29 102 2
## 30 103 11
## 31 104 10
## 32 105 12
## 33 106 10
## 34 107 2
## 35 108 4
## 36 109 10
## 37 110 1
## 38 111 5
## 39 112 17
## 40 113 2
## 41 114 1
## 42 115 6
## 43 116 5
## 44 117 10
## 45 118 1
## 46 119 9
## 47 120 1
## 48 121 4
## 49 122 7
## 50 123 3
## 51 124 2
## 52 126 5
## 53 127 1
## 54 128 7
## 55 130 2
## 56 131 4
## 57 132 2
## 58 134 3
## 59 136 2
## 60 137 2
schoolData %>% count(age)
## age n
## 1 1.0 103
## 2 1.5 103
## 3 2.0 103
schoolData %>% count(trt)
## trt n
## 1 N 135
## 2 Y 174
Conclusions:
The lowest cognitive score was 57.0, and the highest cognitive score
was 137.0. The range of ages was 1 to 2 years. The average cognitive
score was 102.6, and the average age was 1.5 years. The standard
deviation of the cogntiive score variable was about 15.7. The variance
of the cognitive score variable was 246.7. Both the standard deviation
and variance was large for the cognitive score. The standard deviation
and variance of age was low. The most common cognitive score was 96. All
three ages - 1, 1.5, and 2 - appeared equally in the data. More infants
were in the treatment program than not. The total number of infants was
309.
#Data Wrangling
colnames(schoolData) <- c("Cognitive_Score", "Age", "Intervention")
schoolData$Time_In_Program <- schoolData$Age - 0.5
schoolData$Intervention <- sub("Y", "Yes", schoolData$Intervention)
schoolData$Intervention <- sub("N", "No", schoolData$Intervention)
schoolData$Time_In_Program <- ifelse(schoolData$Intervention == "No", 0, schoolData$Time_In_Program)
head(schoolData)
## Cognitive_Score Age Intervention Time_In_Program
## 1 103 1.0 Yes 0.5
## 2 119 1.5 Yes 1.0
## 3 96 2.0 Yes 1.5
## 4 106 1.0 Yes 0.5
## 5 107 1.5 Yes 1.0
## 6 96 2.0 Yes 1.5
In the section above, the columns were renamed, a new column was
added to show how long the infants were in the intervention program, the
values for intervention were changed to words instead of letters, and
infants who were not in the program had their Time_In_Program changed to
0.
library(ggplot2)
ggplot(schoolData, aes(x=Intervention, y=Cognitive_Score)) +
geom_point()

ggplot(schoolData, aes(x=Time_In_Program, y=Cognitive_Score)) +
geom_point()

ggplot(schoolData, aes(x=Age, y=Cognitive_Score)) +
geom_point()

ggplot(data=schoolData, mapping=aes(x=Time_In_Program, y=Cognitive_Score, group=Time_In_Program), outlier.color='blue')+geom_boxplot()

ggplot(schoolData, aes(x=Cognitive_Score)) + geom_histogram() #This histogram shows about normal distribution of scores.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

schoolData$Intervention <- ifelse(schoolData$Intervention == "Yes", 1, 0)
cor(schoolData)
## Cognitive_Score Age Intervention Time_In_Program
## Cognitive_Score 1.00000000 -0.4729575 0.3002091 0.09256145
## Age -0.47295753 1.0000000 0.0000000 0.39432978
## Intervention 0.30020913 0.0000000 1.0000000 0.85079997
## Time_In_Program 0.09256145 0.3943298 0.8508000 1.00000000
Analysis and Conclusion:
Based on the original scatterplot of Intervention compared to
Cognitive_Score, the intervention did not correlate at all with the
cognitive score. I also found the correlation, and it was 0.3, which
shows it was not significant. I decided to create a box plot and other
scatterplots to determine if age or time in the program were factors,
and nothing showed any significant difference. The correlation values
agreed that there was no significant correlation. The program did not
seem to have any affect on the outcome of the cognitive tests. Two
possible explanations could be environmental factors or genetics.
-
Therefore: The infants in the early intervention program did not
perform significantly better on the cognitive test than infants who were
not in the program. Also, the length of time in the program did not
correlate with a higher cognitive score.