DA606 Final Project: An Exploring into Early Childhood Education

Yun Mai
May 15, 2017

knitr::opts_chunk$set(echo = TRUE)

install.packages(c('devtools','openintro'))
devtools::install_github('jbryer/DATA606',force = TRUE)
install.packages("tibble", repos=c("http://rstudio.org/_packages", "http://cran.rstudio.com"))

Load packages

library(RCurl)

## Loading required package: bitops

library(tibble)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:RCurl':
## 
##     complete

library(stringr)
library(knitr)
library(ggplot2)
library(DATA606)

## 
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics 
## This package is designed to support this course. The text book used 
## is OpenIntro Statistics, 3rd Edition. You can read this by typing 
## vignette('os3') or visit www.OpenIntro.org. 
##  
## The getLabs() function will return a list of the labs available. 
##  
## The demo(package='DATA606') will list the demos that are available.

## 
## Attaching package: 'DATA606'

## The following object is masked from 'package:utils':
## 
##     demo

Part 1 - Introduction:

A lot of studies have documented the importance of parent involvement for early childhood education. Parent involvement in a child's education is found to be positively associated with a child's academic performance and social functioning. The goal of this project is to find out whether there is statistically significant association between parent involvement and a child's academic performance from the data collected from Early Childhood Longitudinal Study (ECLS) program conducted by National Center for Educational Statistics (NCES) from 2010 through 2012.

Parent involvement will be categorized into three portions:

parent involvement in preschool education
parent involvement in school activities
parent involvement in summer learning

Research question

Q1.1. Do parent involvement in preschool education influence child early academy outcomes?

Taking the first set of explanatory variables, I will study contemporaneous association between parent involvement in school activities and reading achievement for second grade.

Q1.2. Do parent involvement in school activities influence child early academy outcomes?

Taking the second set of explanatory variables, I will take the advantage of longitudinal follow-up data to investigate the associations between parent involvement in preschool and kindergarten and reading achievement in later grade.

Q1.3. Do parent involvement in summer learning influence child early academy outcomes?

Taking the third set of explanatory variables, I will study contemporaneous association between parent involvement in summer learning and reading achievement for second grade.

Q2. Does school have a rule of that student enter kindergarten must turn 5 before September 1st have effects on children's academic performance in the early childhood?

Q3. Does parents' education level have effects on children's academic performance in the early childhood?

Part 2 - Data:

Data collection

The ECLS-K:2011 is the third and latest study in the Early Childhood Longitudinal Study (ECLS) program collected by the National Center for Educational Statistics (NCES) from 2010 fall to 2013 spring. Survy were conduct every semester and data from the 2010 fall, 2011 spring, 2012 spring, and 2013 spring are more completed.

Data is collected by NCES and is available online here: https://nces.ed.gov/ecls/dataproducts.asp. For this project, data were download and loaded into SPSS as the website suggests, cleaned up (remove most of the repressed variables) and converted to CSV files.

The manual "ECLS-K:2011 Kindergarten User's Manual, Public Version PDF File" and "ECLS-K:2011 Kindergarten-Second Grade User's Manual, Public Version PDF File" is available here: https://nces.ed.gov/ecls/dataproducts.asp. The Eclectronic Code Book could be downloaded from the website too.

Cases

The data is from Eearly Childhood Longitutinal Study following a cohort of children from their kindergarten year (the 2010-11 school year, referred to as the base year) through the 2011-12 school year. The sample includes both children who were in kindergarten for the first time and those who were repeating kindergarten during 2010-11. Students from about 1,310 schools and their parents, teachers, school administrators, and before- and after-school care providers participated in the study. Each case represents children who participated, or whose parent participated, in at least one of the two kindergarten data collections (Fall 2010 or Spring 2011).

There are currently 18174 cases in the earlychildhood dataset under consideration. However after data cleaning and considering only complete cases , the number of cases may reduce a bit.

variables

Response variables

For both of the questions, the response variables are the scores for four academic assessments: reading, math, science, Dimensional Change Card Sort(DCCS, for kindergarten and first grade), and DCCS computed(adjusted DCCS for second grade). They are numerical variables. This porject will foncus on the second grade results.

Explanatory variables

For Q1.1, the explanatory variable is September 1st cutoff. It is categorical variable. For Q1.2, the explanatory variable is parents' highest education level. I would take this variable as numerical variable. American education can be convert as following: a high school level of education is equivalent to 12 years'; an Associate's Degree is equivalent to 14 years', a B.S./B.A. is equivalent to 16 years', etc. a Master's Degree is equivalent to 17-18 years', and Master beyondis equivalent to >= 20 years'.

For Q2, explanatory variables are parent involvment related to early literacy: parent volunteers regularly(Prthlp),(PrtW)

The explanatory variables in the dataset are listed in the table below :

X	Name	Lables	Units	Levels
1	CHILDID	CHILD IDENTIFICATION NUMBER		NA
2	K1_AGE	CHILD ASSESSMENT AGE: Kindergarten_1	years	NA
3	K2_AGE	CHILD ASSESSMENT AGE: Kindergarten_2	years	NA
4	G1_2_AGE	CHILD ASSESSMENT AGE: Grade1_2		NA
5	G2_2_AGE	CHILD ASSESSMENT AGE: Grade2_2		NA
6	CHSEX	CHILD COMPOSITE SEX		NA
7	K1_READ	READING: Kindergarten_1		NA
8	K2_READ	READING: Kindergarten_2		NA
9	G1_2_READ	READING: Grade1_2		NA
10	G2_2_READ	READING: Grade2_2		NA
11	K1_Math	MATH: Kindergarten_1		NA
12	K2_Math	MATH: Kindergarten_2		NA
13	G1_2_Math	MATH: Grade1_2		NA
14	G2_2_Math	MATH: Grade2_2		NA
15	K2_SCI	SCIENCE: Kindergarten_2		NA
16	G1_2_SCI	SCIENCE: Grade1_2		NA
17	G2_2_SCI	SCIENCE: Grade2_2		NA
18	K1_DCCSTOT	Dimensional Change Card Sort: Kindergarten_1		NA
19	K2_DCCSTOT	Dimensional Change Card Sort: Grade1_2		NA
20	G1_2_DCCSTOT	Dimensional Change Card Sort: Grade2_2		NA
21	G2_2_DCCSSCR	Dimensional Change Card Sort computed: Grade2_2		NA
22	K2_GIFK	GIFTED-TALENT NOT OFFERED IN K		2
23	G1_2_GIFK	GIFTED-TALENT NOT OFFERED IN G1		2
24	K2_GIFS	GIFTED-TALENT NOT OFFERED AT SCHOOL		2
25	G1_2_GIFS	GIFTED-TALENT NOT OFFERED AT SCHOOL		2
26	G2_2_Sep1Cut	A CUTOFF DATE FOR CHILD TO TURN FIVE TO ENTER KINDERGARTEN		2
27	G2_2_Sep1Cut_t	convert G2_2_Sep1Cut value to 1 or 0		2
28	K2_CLSI	CLASS SIZES DECREASED: Kindergarten_2		2
29	G1_2_CLSI	CLASS SIZES DECREASED: Grade1_2		2
30	G2_2_CLSI	CLASS SIZES DECREASED:Grade2_2		2
31	PreK	PRESCHOOL GOOD FOR KINDERGARTEN		5
32	PreKt	convert PreK value to number		5
33	K1_LOC	LOCATION TYPE OF SCHOOL: Kindergarten_1		4
34	K2_LOC	LOCATION TYPE OF SCHOOL: Kindergarten_2		4
35	G1_2_LOC	LOCATION TYPE OF SCHOOL: Grade1_2		4
36	G2_2_LOC	LOCATION TYPE OF SCHOOL: Grade1_2		4
37	K1_LOCt	convert K1_LOC value to number: 1=CITY, 2=SUBURB, 3=TOWN, 4=RURAL		4
38	K2_LOCt	convert K2_LOC value to number: 1=CITY, 2=SUBURB, 3=TOWN, 4=RURAL		4
39	G1_2_LOCt	convert G1_2_LOCC value to number: 1=CITY, 2=SUBURB, 3=TOWN, 4=RURAL		4
40	G2_2_LOCt	convert G2_2_LOC value to number: 1=CITY, 2=SUBURB, 3=TOWN, 4=RURAL		4
41	K2_Prthlp	PARENT VOLUNTEERS REGULARLY: Kindergarten_2		6
42	G1_2_Prthlp	PARENT VOLUNTEERS REGULARLY: Grade1_2		6
43	G2_2_Prthlp	PARENT VOLUNTEERS REGULARLY: Grade2_2		6
44	K1_PrtW	PARENTS WORK HOURS PER WEEK: Kindergarten_1		2
45	G1_2_PrtW	PARENTS WORK HOURS PER WEEK: Grade1_2		2
46	G2_2_PrtW	PARENTS WORK HOURS PER WEEK: Grade2_2		2
47	PrtEDU	HIGHEST EDUCATION LEVEL PARENTS ACHIEVED		7
48	PrtEDUt	convert PrtEDU to number: higher number means higher education level		7
49	K2_Prtconf	PARENT ATTEND OPEN HOUSE/PARTY: Kindergarten_2		5
50	G1_2_Prtconf	PARENT ATTEND CONFERENCES: Grade1_2		5
51	G2_2_Prtconf	PARENT ATTEND OPEN HOUSE/PARTY:Grade2_2		5
52	K2_Prtoph	PARENT ATTEND OPEN HOUSE/PARTY: Kindergarten_2		5
53	G1_2_Prtoph	PARENT ATTEND OPEN HOUSE/PARTY: Grade1_2		5
54	G2_2_Prtoph	PARENT ATTEND OPEN HOUSE/PARTY:Grade2_2		5
55	G1_2_Prtevt	PARENT ATTEND ART/MUSIC EVENT: Grade1_2		5
56	G2_2_Prtevt	PARENT ATTEND ART/MUSIC EVENT: Grade2_2		5
57	G1_2_Prtpsk	PARENT OPINOION ON HAVING CHILDREN IN PRESCHOOL		2
58	G2_2_Prtpsk	PARENT OPINOION ON HAVING CHILDREN IN PRESCHOOL		2
59	Prtpremth	PARENT OPINOION ON HAVING PRESCH RD/MATH GOOD FOR SCHOOL		5
60	Prtprelitr	PARENT OPINOION ON HAVING CHILD KNOW ALPHABET BEFORE K		5
61	Prthw	PARENT PROVIDES HOMEWORK TIME		5
62	Prtrc	PARENT SHD READ/COUNT WITH CHILD		5
63	Sumsch	CHILD ATTENDED SUMMER SCHOOL		2
64	Summth	DO MATH ACTVTY WITH CHILD IN SUMMER		5
65	Sumwrt	DO WRITING ACTVTY WITH CHILD IN SUMMER		5

Type of study

The data used in this project comes from a longitudinal study in which data is gathered for the same subjects repeatedly over a period of time. The subjects were interviewed without any assignment process or experimental design. The data collection process does not interfere with how the data arise, thus it can be classified as an observational study.

Scope of inference

The population of interest for ECLS pragram consists of all US students from kindergarten through the fifth grade (*only kindergarten through the second grade data from ECLS pragram were sucessful achieved and used in this project.). Since the respondents where randomly selected then the findings can be generalized.

According to DATA QUALITY AND COMPARABILITY of ECLS pragram on NSES website, the potential errors and bias of the study includ: respondent bias, coverage errors and bias, and nonresponse errors and bias.

One potential source of respondent bias in the ECLS surveys is social desirability bias, when respondents systematically misreport (intentionally or unintentionally) information in a study.

To evaluate and minimize the error and bias, the researchers use the following methods:

"In order to minimize bias, all items were subjected to multiple cognitive interviews and field tests, and actual teachers were involved in the design of the cognitive assessment battery and questionnaires. NCES also followed the criteria recommended in a working paper on the accuracy of teachers' judgments of students' academic performances (see Perry and Meisels 1996)."
By designing the child assessments to be both individually administered and untimed, both coverage error and bias were reduced.
Three methods had been used to to determine if substantial bias was introduced into the data from the kindergarten collections as a result of nonresponse. Findings from these analyses suggest that there is not a substantial bias in the kindergarten year due to nonresponse after adjusting for that nonresponse.

The ECLS program design minimizes potential bias so for the purpose of this project the ECLS results will be considered fully generalizable.

Scope of inference

Since it is observation data, and not the experiment, there can be no causality, but only correlation.

The data

# Load data from Github.
url <- getURL("https://raw.githubusercontent.com/YunMai-SPS/DA606/master/final_project/ECLS_2011_K2.csv")
 
library(data.table)

## -------------------------------------------------------------------------

## data.table + dplyr code now lives in dtplyr.
## Please library(dtplyr)!

## -------------------------------------------------------------------------

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

earlyedu <- fread(url, header = T, sep = ',')
kable(head(earlyedu))

CHILDID	X6AGE	X5AGE	X4AGE	X3AGE	X2KAGE_R	X1KAGE_R	X_CHSEX_R	X1RTHETK2	X2RTHETK2	X3RTHETK2	X4RTHETK2	X5RTHETK2	X6RTHETK2	X1MTHETK2	X2MTHETK2	X3MTHETK2	X4MTHETK2	X5MTHETK2	X6MTHETK2	X2STHETK2	X3STHETK2	X4STHETK2	X5STHETK2	X6STHETK2	X1DCCSTOT	X2DCCSTOT	X3DCCSTOT	X4DCCSTOT	X5DCCSSCR	X6DCCSSCR	X6REGION	X5REGION	X4REGION	X3REGION	X2REGION	X1REGION	X6LOCALE	X5LOCALE	X4LOCALE	X3LOCALE	X2LOCALE	X1LOCALE	X6PAR1EMP_I	X4PAR1EMP_I	X1PAR1EMP	X4PAR1ED_I	X6PAR1OCC_I	X4PAR1OCC_I	X1PAR1OCC_I	X6PAR2OCC_I	X4PAR2OCC_I	X1PAR2OCC_I	X6PAR1SCR_I	X4PAR1SCR_I	X1PAR1SCR_I	X6DISTPOV	X4DISTPOV	X_DISTPOV	S2GIFNOG	S2GIFNO	S4GIFNO	S4GIFNOG	S6GIFNOG	S6GIFNO	A1ATNDPR	S6TTLPRE	S4TTLPRE	A1INKNDR	A1VSTK	A1SHRTN	A1STAGGR	A1PRNTOR	A1HMEVST	A1COMM	A1IDCOLO	A1FOLWDR	A1ALPHBT	A1SITSTI	A1SENSTI	A1ENGLAN	A1NOTDSR	A1PENCIL	A1PRBLMS	A1SHARE	A1CNT20	A1FNSHT	S6CCLSDE	S6TT1CLA	S4TT1CLA	S2TT1CLA	A1FRMLIN	A1ALPHBF	A1LRNREA	A1TCHPRN	A1PRCTWR	A1HMWRK	A1READAT	A1CNTRLC	A1ENJOY	A1MKDIFF	A1TEACH	A1CLSSIZ	A1NATEXM	A1EARLY	A1ELEM	A1DEVLP	A1MTHDRD	A1MTHDMA	A1MTHDSC	A1RSPINT	A1INTSRV	A1STATCT	A1HIGHQL	A1YRBORN	A1HGHSTD	A1HGHPAR	A1YRSCH	P3DOMATH	P3DOWRIT	P3RDBKTC	P3HWLGRD	P3RDALON	P3COMEDU	P3OUTACT	P3TVHR	P3TVMIN	P3VIDHR	P3VIDMIN	P3VISLIB	P3STHLIB	P3SUMBK	P3SUMRD	P3ARTMUS	P3ZOOS	P3AMUSPK	P3BEACHS	P3PLYCRT	P3LRGCTY	P3SUMSCH	P3NDYPRM	P3NHRPRM	P3SMREAD	P3SMMATH	P3SMSCI	P3SMART	P3SMMUSI	P3SMCMPT	P3SMREQ	P3DONCMP	P3NUMCMP	P3NMDCMP	P3NMHCMP	P3NMWCMP	P3CMPSPT	P3CMPART	P3CMPCPT	P3CMPACA	P3CMPMPA	P3CMPSUP	P3TUTOR	A2REGHLP	A4REGHLP	A6REGHLP	A2TPCONF	A4TPCONF	A6TPCONF	A2ATTOPN	A4ATTOPN	A6ATTOPN	VAR00001	A4ATTART	A6ATTART
10000001	103.66	NA	91.66	NA	79.76	72.39	1	0.1930	1.7614	NA	3.1523	NA	3.3637	0.5827	1.5384	NA	2.7286	NA	3.2950	1.3738	NA	3.3354	NA	2.5071	16	17	NA	16	NA	7.3280	-2	-2	-2	-2	-2	-2	4	NA	4	NA	4	4	4	4	4	5	-1	-1	-1	19	19	19	-1.00	-1.00	-1.00	38	39	45	-9	-9	1	0	0	0	3	1	2	1	1	1	2	1	2	3	2	3	2	3	3	2	3	3	2	3	2	3	2	1	1	2	3	2	4	4	3	2	4	4	5	5	4	5	1	1	1	1	1	1	1	1	2	1	3	7	5	6	5	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	3	2	2	5	5	5	3	5	3	3	5	5
10000002	NA	NA	89.56	NA	77.52	71.41	2	-0.7870	0.7452	NA	2.2023	NA	NA	-0.3473	0.8800	NA	2.0825	NA	NA	0.8811	NA	1.3122	NA	NA	17	16	NA	16	NA	NA	-2	-2	-2	-2	-2	-2	NA	NA	2	NA	2	2	NA	4	4	5	NA	-1	-1	NA	1	1	NA	-1.00	-1.00	NA	7	7	1	1	1	0	NA	NA	3	NA	-1	1	1	2	2	1	2	4	3	4	3	3	3	4	4	2	4	5	3	3	NA	NA	-1	2	2	3	4	2	2	1	5	5	5	5	5	2	1	1	1	1	1	1	1	2	2	1	1	2	5	5	27	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	2	4	NA	5	5	NA	5	4	NA	NA	4	NA
10000003	96.26	NA	84.33	NA	73.51	NA	1	NA	0.3323	NA	1.1861	NA	2.0689	NA	1.0112	NA	2.2080	NA	2.8384	0.4244	NA	2.2479	NA	2.3529	NA	14	NA	16	NA	8.2395	-2	-2	-2	-2	-2	-2	2	NA	2	NA	2	NA	1	1	NA	9	7	7	NA	-1	-1	NA	77.50	77.50	NA	12	11	9	2	1	1	0	0	1	NA	-1	-1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1	-1	-1	-1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	2	2	4	5	5	5	5	5	5	5	5	4
10000004	101.19	NA	89.23	NA	78.84	72.39	1	-1.7087	-0.2922	NA	0.8632	NA	1.5117	-2.1245	-0.2834	NA	1.1082	NA	2.8392	0.6509	NA	0.2933	NA	1.7845	15	14	NA	17	NA	7.2868	-2	-2	-2	-2	-2	-2	4	NA	4	NA	4	4	NA	1	1	5	NA	13	13	NA	16	16	NA	38.18	38.18	20	20	17	-9	-9	0	0	0	0	5	1	1	2	1	2	2	1	2	5	4	4	5	5	3	-9	5	5	3	4	4	4	1	1	2	2	5	5	5	4	4	2	5	5	5	5	5	5	2	2	2	1	2	2	2	2	2	1	1	5	5	5	6	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	2	2	3	3	3	3	3	4	3	3	5	3
10000005	NA	NA	NA	NA	65.06	59.51	2	-0.0734	0.8013	NA	NA	NA	NA	-0.5545	0.6163	NA	NA	NA	NA	0.3996	NA	NA	NA	NA	14	17	NA	NA	NA	NA	-2	-2	-2	-2	-2	-2	NA	NA	NA	NA	2	2	NA	NA	1	NA	NA	NA	4	NA	NA	1	NA	NA	59.00	NA	NA	-1	NA	NA	NA	NA	NA	NA	5	NA	NA	1	1	2	2	1	2	5	4	5	5	5	4	4	5	4	3	5	4	4	NA	NA	NA	NA	5	5	5	5	5	5	5	4	5	5	5	5	1	1	2	1	1	1	1	2	2	1	2	5	4	3	7	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	2	NA	NA	5	NA	NA	5	NA	NA	NA	NA	NA
10000006	107.51	NA	94.82	NA	82.62	75.78	2	-1.4601	-0.6792	NA	1.0132	NA	1.9750	-1.0616	-0.5379	NA	0.4484	NA	1.0869	-0.6320	NA	-0.7748	NA	-0.2110	16	14	NA	18	NA	5.3872	-2	-2	-2	-2	-2	-2	4	NA	4	NA	4	4	3	1	1	3	20	19	19	-1	-1	-1	35.92	33.42	33.42	10	12	11	-9	-9	0	0	0	0	4	2	1	2	1	1	1	1	2	4	4	4	3	4	4	3	3	4	3	4	3	4	1	1	1	2	3	3	4	4	3	2	4	4	5	4	4	5	2	2	1	1	1	1	1	2	2	1	1	8	5	2	8	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	2	2	2	5	5	5	2	5	2	2	5	5

# Load value lables from Github.
lable <- read.csv("https://raw.githubusercontent.com/YunMai-SPS/DA606/master/final_project/variablelable.csv")
kable(head(lable,n=10))

Variable	Position	Label	Measurement.Level	Role	Column.Width	Alignment	Print.Format	Write.Format
CHILDID	1	CHILD IDENTIFICATION NUMBER	Nominal	Input	10	Left	A8	A8
X6AGE	2	X6 CHILD ASSESSMENT AGE(MNTHS)	Scale	Input	9	Right	F7.2	F7.2
X5AGE	3	X5 CHILD ASSESSMENT AGE(MNTHS)	Scale	Input	9	Right	F7.2	F7.2
X4AGE	4	X4 CHILD ASSESSMENT AGE(MNTHS)	Scale	Input	8	Right	F6.2	F6.2
X3AGE	5	X3 CHILD ASSESSMENT AGE(MNTHS)	Scale	Input	8	Right	F6.2	F6.2
X2KAGE_R	6	X2 CHILD ASSESSMENT AGE(MNTHS)-REV	Scale	Input	10	Right	F6.2	F6.2
X1KAGE_R	7	X1 CHILD ASSESSMENT AGE(MNTHS)-REV	Scale	Input	10	Right	F6.2	F6.2
X_CHSEX_R	8	CHILD COMPOSITE SEX - REVISED	Scale	Input	11	Right	F2	F2
X1RTHETK2	9	X1 READING THETA-K2 DATA FILE	Scale	Input	11	Right	F8.4	F8.4
X2RTHETK2	10	X2 READING THETA-K2 DATA FILE	Scale	Input	11	Right	F8.4	F8.4

value <- read.csv("https://raw.githubusercontent.com/YunMai-SPS/DA606/master/final_project/variablevalue.csv")
kable(head(value, n=10))

Value	Label
X6AGE	-9: NOT ASCERTAINED
X5AGE	-9: NOT ASCERTAINED
X4AGE	-9: NOT ASCERTAINED
X3AGE	-9: NOT ASCERTAINED
X2KAGE_R	-9: NOT ASCERTAINED
X1KAGE_R	-9: NOT ASCERTAINED
X_CHSEX_R	-9: NOT ASCERTAINED
X_CHSEX_R	1: MALE
X_CHSEX_R	2: FEMALE
X1RTHETK2	-9: NOT ASCERTAINED

Select the semester for anlysis.The prefix of the variables name could be infered from the table below. The prefix are 1:2010 fall(K1): 2011 spring(K2); 3, 2011 fall(11f); 4: 2012 sprinsg(G1_2); 5: 2012 fall(12f): 6: 2013 spring(G2_2).

Prefix of the variables name

# There are a lot of columns in these data frames. I will subset the dataframes to get the variables relevant to the analysis and also rename the columns for reading convenience.

# The data were collected through 6 consecutive semesters. In some semesters the survay were only done in one thirds of students. 

# The data from the survay with same sample size will be selected for this project. I will subset the data to get the survey results from 4 semesters: Kindergarten 1st semester(1), Kindergarten 2nd semester(2), first grade 2nd semester(4), second grade 2nd semester(6). 

# reading scores
myreading <- c("CHILDID","X1RTHETK2",   "X2RTHETK2",    "X4RTHETK2",    "X6RTHETK2") 
reading <- subset(earlyedu,,myreading)
reading$X1RTHETK2 <- as.character(reading$X1RTHETK2)
reading$X2RTHETK2 <- as.character(reading$X2RTHETK2)
reading$X4RTHETK2 <- as.character(reading$X4RTHETK2)
reading$X6RTHETK2 <- as.character(reading$X6RTHETK2)
reading$X1RTHETK2 <- str_replace_all(reading$X1RTHETK2,"(^\\-9)","")
reading$X2RTHETK2 <- str_replace_all(reading$X2RTHETK2,"(^\\-9)","")
reading$X4RTHETK2 <- str_replace_all(reading$X4RTHETK2,"(^\\-9)","")
reading$X6RTHETK2 <- str_replace_all(reading$X6RTHETK2,"(^\\-9)","")
reading$X1RTHETK2 <- as.numeric(reading$X1RTHETK2)
reading$X2RTHETK2 <- as.numeric(reading$X2RTHETK2)
reading$X4RTHETK2 <- as.numeric(reading$X4RTHETK2)
reading$X6RTHETK2 <- as.numeric(reading$X6RTHETK2)
colnames(reading) <- c("CHILDID", "K1_READ", "K2_READ", "G1_2_READ", "G2_2_READ")

# math scores
mymath <- c("CHILDID","X1MTHETK2",  "X2MTHETK2",    "X4MTHETK2",    "X6MTHETK2") 
math <- subset(earlyedu,,mymath)
math$X1MTHETK2 <- as.character(math$X1MTHETK2)
math$X2MTHETK2 <- as.character(math$X2MTHETK2)
math$X4MTHETK2 <- as.character(math$X4MTHETK2)
math$X6MTHETK2 <- as.character(math$X6MTHETK2)
math$X1MTHETK2 <- str_replace_all(math$X1MTHETK2,"(^\\-9)","")
math$X2MTHETK2 <- str_replace_all(math$X2MTHETK2,"(^\\-9)","")
math$X4MTHETK2 <- str_replace_all(math$X4MTHETK2,"(^\\-9)","")
math$X6MTHETK2 <- str_replace_all(math$X6MTHETK2,"(^\\-9)","")
math$X1MTHETK2 <- as.numeric(math$X1MTHETK2)
math$X2MTHETK2 <- as.numeric(math$X2MTHETK2)

## Warning: NAs introduced by coercion

math$X4MTHETK2 <- as.numeric(math$X4MTHETK2)
math$X6MTHETK2 <- as.numeric(math$X6MTHETK2)
colnames(math) <- c("CHILDID", "K1_Math", "K2_Math", "G1_2_Math", "G2_2_Math")

# science scores
myscience <- c("CHILDID", "X2STHETK2",  "X4STHETK2",    "X6STHETK2") 
science <- subset(earlyedu,,myscience)
science$X2STHETK2 <- as.character(science$X2STHETK2)
science$X4STHETK2 <- as.character(science$X4STHETK2)
science$X6STHETK2 <- as.character(science$X6STHETK2)
science$X2STHETK2 <- str_replace_all(science$X2STHETK2,"(^\\-9)","")
science$X4STHETK2 <- str_replace_all(science$X4STHETK2,"(^\\-9)","")
science$X6STHETK2 <- str_replace_all(science$X6STHETK2,"(^\\-9)","")
science$X2STHETK2 <- as.numeric(science$X2STHETK2)
science$X4STHETK2 <- as.numeric(science$X4STHETK2)

## Warning: NAs introduced by coercion

science$X6STHETK2 <- as.numeric(science$X6STHETK2)
colnames(science) <- c("CHILDID", "K2_SCI", "G1_2_SCI", "G2_2_SCI")

# The Dimensional Change Card Sort (DCCS): A Method of Assessing Executive Function in Children
myDCCSTOT <- c("CHILDID", "X1DCCSTOT",  "X2DCCSTOT",    "X4DCCSTOT") 
DCCSTOT <- subset(earlyedu,,myDCCSTOT)
DCCSTOT$X1DCCSTOT <- as.character(DCCSTOT$X1DCCSTOT)
DCCSTOT$X2DCCSTOT <- as.character(DCCSTOT$X2DCCSTOT)
DCCSTOT$X4DCCSTOT <- as.character(DCCSTOT$X4DCCSTOT)
DCCSTOT$X1DCCSTOT <- str_replace_all(DCCSTOT$X1DCCSTOT,"(^\\-9)","")
DCCSTOT$X2DCCSTOT <- str_replace_all(DCCSTOT$X2DCCSTOT,"(^\\-9)","")
DCCSTOT$X4DCCSTOT <- str_replace_all(DCCSTOT$X4DCCSTOT,"(^\\-9)","")
DCCSTOT$X1DCCSTOT <- as.numeric(DCCSTOT$X1DCCSTOT)
DCCSTOT$X2DCCSTOT <- as.numeric(DCCSTOT$X2DCCSTOT)
DCCSTOT$X4DCCSTOT <- as.numeric(DCCSTOT$X4DCCSTOT)
colnames(DCCSTOT) <- c("CHILDID", "K1_DCCSTOT", "K2_DCCSTOT", "G1_2_DCCSTOT")

# DCCS composite score by summing the post-switch score and the Border Game score. Relative completed survay results are only available at 6th semester.
myDCCSSCR <- c("CHILDID", "X6DCCSSCR") 
DCCSSCR <- subset(earlyedu,,myDCCSSCR)
DCCSSCR$X6DCCSSCR <- as.character(DCCSSCR$X6DCCSSCR)
DCCSSCR$X6DCCSSCR <- str_replace_all(DCCSSCR$X6DCCSSCR,"(^\\-9)","")
DCCSSCR$X6DCCSSCR <- as.numeric(DCCSSCR$X6DCCSSCR)
colnames(DCCSSCR) <- c("CHILDID", "G2_2_DCCSSCR")

# GIFTED-TALENT not offered in K. Survay results are only available at 2nd and 4th semester.
GTKlevels <- c("yes", "no", "NA")
myGTK <- c("CHILDID", "S2GIFNO","S4GIFNO") 
GTK <- subset(earlyedu,,myGTK)
GTK$S2GIFNO <- str_replace_all(GTK$S2GIFNO,"(^\\-9)|(^\\-1)","NA") %>% 
  str_replace_all("1","yes") %>% 
  str_replace_all("2","no")
GTK$S4GIFNO <- str_replace_all(GTK$S4GIFNO,"(^\\-9)|(^\\-1)","NA") %>% 
  str_replace_all("0","yes") %>%
  str_replace_all("1","no")
colnames(GTK) <- c("CHILDID", "K2_GIFK", "G1_2_GIFK")

# GIFTED-TALENT offered at some grades or not offered at the school.Survay results are only available at 2nd and 4th semester.
GTSlevels <- c("yes", "no", "NA")
myGTS <- c("CHILDID", "S2GIFNOG","S4GIFNOG") 
GTS <- subset(earlyedu,,myGTS)
GTS$S2GIFNOG <- str_replace_all(GTS$S2GIFNOG,"(^\\-9)|(^\\-1)","NA") %>% 
   str_replace_all("1","yes") %>% 
  str_replace_all("2","no")
GTS$S4GIFNOG <- str_replace_all(GTS$S4GIFNOG,"(^\\-9)|(^\\-1)","NA") %>%
  str_replace_all("0","yes") %>% 
  str_replace_all("1","no")
colnames(GTS) <- c("CHILDID", "K2_GIFS", "G1_2_GIFS")

# A CUTOFF DATE FOR CHILD TO TURN FIVE TO ENTER KINDERGARTEN
Sep1cutofflevels <- c(1,2,NA)
mycutoff <- c("CHILDID", "S6GIFNO") 
Sep1cutoff <- subset(earlyedu,,mycutoff)
Sep1cutoff$S6GIFNOt <- str_replace_all(Sep1cutoff$S6GIFNO,"0","yes") %>% 
  str_replace_all("1","no")
colnames(Sep1cutoff) <- c("CHILDID", "G2_2_Sep1Cut", "G2_2_Sep1Cut_t")

# Whether PreK is helpful for prepare children for Kindergarten.
PreKlevles <- c(1, 2, 3, 4, 5)
myprek <- c("CHILDID", "A1ATNDPR") 
PreK <- subset(earlyedu,,myprek)
PreK$A1ATNDPR  <- str_replace_all(PreK$A1ATNDPR,"^\\-9","NA") 
PreK$A1ATNDPRt  <- str_replace_all(PreK$A1ATNDPR, "1","STRONGLY DISAGREE") %>% 
  str_replace_all("2","DISAGREE") %>% 
  str_replace_all("3","NEITHER AGREE NOR DISAGREE") %>% 
  str_replace_all("4","AGREE") %>% 
  str_replace_all("5","STRONGLY AGREE") 
colnames(PreK) <- c("CHILDID", "PreK", "PreKt")

agelevels <- c(4, 5, 6, 7, 8)
myage <- c("CHILDID", "X1KAGE_R", "X2KAGE_R", "X4AGE", "X6AGE") 
Age <- subset(earlyedu,,myage)
Age[,2]<- Age[,2]/12
Age[,3]<- Age[,3]/12
Age[,4]<- Age[,4]/12
Age[,5]<- Age[,5]/12
Age$X1KAGE_R <-as.character(Age$X1KAGE_R)
Age$X2KAGE_R <-as.character(Age$X2KAGE_R)
Age$X4AGE <-as.character(Age$X4AGE)
Age$X6AGE <-as.character(Age$X6AGE)
Age$X1KAGE_R <-str_replace_all(Age$X1KAGE_R,"^\\-9","NA")
Age$X2KAGE_R <-str_replace_all(Age$X2KAGE_R,"^\\-9","NA")
Age$X4AGE <-str_replace_all(Age$X4AGE,"^\\-9","NA")
Age$X6AGE <-str_replace_all(Age$X6AGE,"^\\-9","NA")
Age$X1KAGE_R <-as.numeric(Age$X1KAGE_R)
Age$X2KAGE_R <-as.numeric(Age$X2KAGE_R)
Age$X4AGE <-as.numeric(Age$X4AGE)
Age$X6AGE <-as.numeric(Age$X6AGE)
colnames(Age) <- c("CHILDID", "K1_AGE", "K2_AGE", "G1_2_AGE", "G2_2_AGE")

genderLevels <- c('Male', 'Female')
mygender <- c("CHILDID", "X_CHSEX_R") 
Gender <- subset(earlyedu,,mygender)
Gender$X_CHSEX_R <- str_replace_all(Gender$X_CHSEX_R,"1","Male") %>% 
  str_replace_all("2","Female")
Gender$X_CHSEX_R <- str_replace_all(Gender$X_CHSEX_R,"^\\-9","NA")
colnames(Gender) <- c("CHILDID", "CHSEX")

#class size decrease?
decreasesizelevel <- c("yes", "no", "NA")
mydecreasesize <- c("CHILDID", "S2TT1CLA", "S4TT1CLA", "S6TT1CLA") 
Classsize <- subset(earlyedu,,mydecreasesize)
Classsize$S2TT1CLA <- str_replace_all(Classsize$S2TT1CLA,"(^\\-9)|(^\\-1)","NA") %>% 
  str_replace_all("1","yes") %>% 
  str_replace_all("2","no")
Classsize$S4TT1CLA <- str_replace_all(Classsize$S4TT1CLA,"(^\\-9)|(^\\-1)","NA") %>% 
  str_replace_all("1","yes") %>%
  str_replace_all("2","no")
Classsize$S6TT1CLA <- str_replace_all(Classsize$S6TT1CLA,"(^\\-9)|(^\\-1)","NA") %>% 
  str_replace_all("1","yes") %>% 
  str_replace_all("2","no")
colnames(Classsize) <- c("CHILDID", "K2_CLSI", "G1_2_CLSI", "G2_2_CLSI")
Classsize$K2_CLSIt <- str_replace_all(Classsize$K2_CLSI, "yes", "1") %>%
  str_replace_all("no", "2")
Classsize$G1_2_CLSIt <- str_replace_all(Classsize$G1_2_CLSI, "yes", "1") %>%
  str_replace_all("no", "2") 
Classsize$G2_2_CLSIt <- str_replace_all(Classsize$G2_2_CLSI, "yes", "1") %>%
  str_replace_all("no", "2") 

# region of the school located
regionlevels <-c(1,2,3,4,NA)
myregion <- c("CHILDID", "X1LOCALE", "X2LOCALE", "X4LOCALE", "X6LOCALE") 
Region <- subset(earlyedu,,myregion)
Region$X1LOCALE <- str_replace_all(Region$X1LOCALE,"(^\\-9)|(^\\-1)","NA") 
Region$X1LOCALt <- str_replace_all(Region$X1LOCALE, "1", "CITY") %>%
  str_replace_all("2", "SUBURB") %>% 
  str_replace_all("3", "TOWN") %>% 
  str_replace_all("4", "RURAL") 
Region$X2LOCALE <- str_replace_all(Region$X2LOCALE,"(^\\-9)|(^\\-1)","NA") 
Region$X2LOCALt <- str_replace_all(Region$X2LOCALE, "1", "CITY") %>%
  str_replace_all("2", "SUBURB") %>% 
  str_replace_all("3", "TOWN") %>% 
  str_replace_all("4", "RURAL") 
Region$X4LOCALE <- str_replace_all(Region$X4LOCALE,"(^\\-9)|(^\\-1)","NA")
Region$X4LOCALt <- str_replace_all(Region$X4LOCALE, "1", "CITY") %>%
  str_replace_all("2", "SUBURB") %>% 
  str_replace_all("3", "TOWN") %>% 
  str_replace_all("4", "RURAL")
Region$X6LOCALE <- str_replace_all(Region$X6LOCALE,"(^\\-9)|(^\\-1)","NA")
Region$X6LOCALt <- str_replace_all(Region$X6LOCALE, "1", "CITY") %>%
  str_replace_all("2", "SUBURB") %>% 
  str_replace_all("3", "TOWN") %>% 
  str_replace_all("4", "RURAL")
colnames(Region) <- c("CHILDID", "K1_LOC","K2_LOC", "G1_2_LOC", "G2_2_LOC", "K1_LOCt","K2_LOCt", "G1_2_LOCt", "G2_2_LOCt")

# Parents getting involve in school volunteer 
parentschoolhlplevels <- c(0, 1, 2, 3, 4,5,NA)
myhlp <- c("CHILDID", "A2REGHLP", "A4REGHLP", "A6REGHLP") 
Prthlp <- subset(earlyedu,,myhlp)
Prthlp$A2REGHLP  <- str_replace_all(Prthlp$A2REGHLP,"^\\-9","NA")
Prthlp$A4REGHLP  <- str_replace_all(Prthlp$A4REGHLP,"^\\-9","NA")
Prthlp$A6REGHLP  <- str_replace_all(Prthlp$A6REGHLP,"^\\-9","NA")
colnames(Prthlp) <- c("CHILDID", "K2_Prthlp", "G1_2_Prthlp", "G2_2_Prthlp")

# Parents work hours per week 
parentworklevels <- c("b35h", "a35h", "NA")
myprtwork <- c("CHILDID", "X1PAR1EMP","X4PAR1EMP_I", "X6PAR1EMP_I") 
Prtwork <- subset(earlyedu,,myprtwork)
Prtwork$X1PAR1EMP <- str_replace_all(Prtwork$X1PAR1EMP,"(^\\-9)|(^3)|(^4)","NA") %>% 
  str_replace_all("1","a35h") %>% 
  str_replace_all("2","blw35h")
Prtwork$X4PAR1EMP_I <- str_replace_all(Prtwork$X4PAR1EMP_I,"(^\\-9)|(^3)|(^4)","NA") %>% 
  str_replace_all("1","a35h") %>% 
  str_replace_all("2","blw35h")
Prtwork$X6PAR1EMP_I <- str_replace_all(Prtwork$X6PAR1EMP_I,"(^\\-9)|(^3)|(^4)","NA") %>% 
  str_replace_all("1","a35h") %>% 
  str_replace_all("2","blw35h")
colnames(Prtwork) <- c("CHILDID", "K1_PrtW", "G1_2_PrtW", "G2_2_PrtW")
Prtwork$K1_PrtWt <- str_replace_all(Prtwork$K1_PrtW, "a35h", "1") %>%
  str_replace_all("blw35h", "2")
Prtwork$G1_2_PrtWt <- str_replace_all(Prtwork$G1_2_PrtW, "a35h", "1") %>%
  str_replace_all("blw35h", "2")
Prtwork$G2_2_PrtWt <- str_replace_all(Prtwork$G2_2_PrtW, "a35h", "1") %>%
  str_replace_all("blw35h", "2")

# HIGHEST EDUCATION LEVEL PARENTS ACHIEVED
Parentedulevles <- c(1, 2, 3, 4, 5, 6, 7) 
myedu <- c("CHILDID", "A1HGHPAR") 
Parentedu <- subset(earlyedu,,myedu)
Parentedu$A1HGHPAR <- str_replace_all(Parentedu$A1HGHPAR, "(\\-9)|(8)", "NA")
Parentedu$A1HGHPARt <- str_replace_all(Parentedu$A1HGHPAR,"1","DID NOT COMPLETE HIGH SCHOOL") %>%
  str_replace_all("2","HIGH SCHOOL") %>% 
  str_replace_all("3","SOME COLLEGE") %>% 
  str_replace_all("4","ASSOCIATE'S DEGREE") %>% 
  str_replace_all("5","BACHELOR") %>% 
  str_replace_all("6","MASTER") %>% 
  str_replace_all("7","BEYOND A MASTER")
colnames(Parentedu) <- c("CHILDID", "PrtEDU", "PrtEDUt")

# Parent attend conferences
parentconflevels <- c(0, 1, 2, 3, 4,5,NA)
myprtconf <- c("CHILDID", "A2TPCONF","A4TPCONF", "A6TPCONF") 
Prtconf <- subset(earlyedu,,myprtconf)
Prtconf$A2TPCONF  <- str_replace_all(Prtconf$A2TPCONF,"^\\-9","NA")
Prtconf$A4TPCONF  <- str_replace_all(Prtconf$A4TPCONF,"^\\-9","NA")
Prtconf$A6TPCONF  <- str_replace_all(Prtconf$A6TPCONF,"^\\-9","NA")
colnames(Prtconf) <- c("CHILDID", "K2_Prtconf", "G1_2_Prtconf", "G2_2_Prtconf")

# PARENT ATTEND OPEN HOUSE/PARTY
parentophlevels <- c(0, 1, 2, 3, 4,5,NA)
myparentoph  <- c("CHILDID", "A2ATTOPN","A4ATTOPN", "A6ATTOPN") 
Prtoph <- subset(earlyedu,,myparentoph)
Prtoph$A2ATTOPN  <- str_replace_all(Prtoph$A2ATTOPN,"^\\-9","NA")
Prtoph$A4ATTOPN  <- str_replace_all(Prtoph$A4ATTOPN,"^\\-9","NA")
Prtoph$A6ATTOPN  <- str_replace_all(Prtoph$A6ATTOPN,"^\\-9","NA")
colnames(Prtoph) <- c("CHILDID", "K2_Prtoph", "G1_2_Prtoph", "G2_2_Prtoph")

# PARENT ATTEND ART/MUSIC EVENT
parentevtlevels <- c(0, 1, 2, 3, 4,5,NA)
myparentevt  <- c("CHILDID", "A4ATTART","A6ATTART") 
Prtevt <- subset(earlyedu,,myparentevt)
Prtevt$A4ATTART <-as.character(Prtevt$A4ATTART )
Prtevt$A6ATTART <-as.character(Prtevt$A6ATTART )
Prtevt$A4ATTART <- str_replace_all(Prtevt$A4ATTART,"^\\-9","NA")
Prtevt$A6ATTART  <- str_replace_all(Prtevt$A6ATTART,"^\\-9","NA")
Prtevt$A4ATTART <-as.numeric(Prtevt$A4ATTART )

## Warning: NAs introduced by coercion

Prtevt$A6ATTART <-as.numeric(Prtevt$A6ATTART )

## Warning: NAs introduced by coercion

colnames(Prtevt) <- c("CHILDID", "G1_2_Prtevt", "G2_2_Prtevt")

#CHILDREN IN PRESCHOOL
parentpsklevels <- c(1, 2, NA)
myparentpsk  <- c("CHILDID", "S4TTLPRE","S6TTLPRE") 
Prtpsk <- subset(earlyedu,,myparentpsk)
Prtpsk$S4TTLPRE <- as.character(Prtpsk$S4TTLPRE)
Prtpsk$S6TTLPRE <- as.character(Prtpsk$S6TTLPRE)
Prtpsk$S4TTLPRE  <- str_replace_all(Prtpsk$S4TTLPRE,"^\\-9","NA")
Prtpsk$S4TTLPRE  <- str_replace_all(Prtpsk$S4TTLPRE,"^\\-1","NA")
Prtpsk$S6TTLPRE  <- str_replace_all(Prtpsk$S6TTLPRE,"^\\-9","NA")
Prtpsk$S6TTLPRE  <- str_replace_all(Prtpsk$S6TTLPRE,"^\\-1","NA")
Prtpsk$S4TTLPRE <- as.numeric(Prtpsk$S4TTLPRE)

## Warning: NAs introduced by coercion

Prtpsk$S6TTLPRE <- as.numeric(Prtpsk$S6TTLPRE)

## Warning: NAs introduced by coercion

colnames(Prtpsk) <- c("CHILDID", "G1_2_Prtpsk", "G2_2_Prtpsk")

#PRESCH RD/MATH GOOD FOR SCHOOL
parentpremthlevels <- c(1, 2, 3, 4,5,NA)
myparentpremth  <- c("CHILDID", "A1FRMLIN") 
Prtpremth <- subset(earlyedu,,myparentpremth)
Prtpremth$A1FRMLIN  <- str_replace_all(Prtpremth$A1FRMLIN,"^\\-9","NA")
colnames(Prtpremth) <- c("CHILDID", "Premath")

#Parent HAVE CHILD KNOW ALPHABET BEFORE K
parentlitrlevels <- c(1, 2, 3, 4,5,NA)
myparentprelitr  <- c("CHILDID", "A1ALPHBF") 
Prtprelitr <- subset(earlyedu,,myparentprelitr)
Prtprelitr$A1ALPHBF  <- str_replace_all(Prtprelitr$A1ALPHBF,"^\\-9","NA")
colnames(Prtprelitr) <- c("CHILDID", "Prelitr")

# PARENT PROVIDES HOMEWORK TIME
parenthwlevels <- c(1, 2, 3, 4,5,NA)
myparenthw  <- c("CHILDID", "A1PRCTWR") 
Prthw <- subset(earlyedu,,myparenthw)
Prthw$A1PRCTWR  <- str_replace_all(Prthw$A1PRCTWR,"^\\-9","NA")
colnames(Prthw) <- c("CHILDID", "Prthw")

# PARENT SHD READ/COUNT WITH CHILD
parentrclevels <- c( 1, 2, 3, 4,5,NA)
myparentrc  <- c("CHILDID", "A1READAT") 
Prtrc <- subset(earlyedu,,myparentrc)
Prtrc$A1READAT  <- str_replace_all(Prtrc$A1READAT,"^\\-9","NA")
colnames(Prtrc) <- c("CHILDID", "Prerc")

# CHILD ATTENDED SUMMER SCHOOL
sumschlevels <- c(1, 2,NA)
mysumsch  <- c("CHILDID", "P3SUMSCH") 
Sumsch <- subset(earlyedu,,mysumsch)
Sumsch$P3SUMSCH  <- str_replace_all(Sumsch$P3SUMSCH,"^\\-9","NA")
Sumsch$P3SUMSCH  <- str_replace_all(Sumsch$P3SUMSCH,"^\\-8","NA")
Sumsch$P3SUMSCH  <- str_replace_all(Sumsch$P3SUMSCH,"^\\-7","NA")
colnames(Sumsch) <- c("CHILDID", "Sumsch")

# DO MATH ACTVTY WITH CHILD IN SUMMER
summthlevels <- c(1,2,3,4,5,NA)
mysummth <- c("CHILDID", "P3DOMATH") 
Summth <- subset(earlyedu,,mysummth)
Summth$P3DOMATH  <- str_replace_all(Summth$P3DOMATH,"^\\-9","NA")
Summth$P3DOMATH  <- str_replace_all(Summth$P3DOMATH,"^\\-8","NA")
Summth$P3DOMATH  <- str_replace_all(Summth$P3DOMATH,"^\\-7","NA")
colnames(Summth) <- c("CHILDID", "Summth")

#DO WRITING ACTVTY WITH CHILD IN SUMMER
sumwrtlevels <- c(1,2,3,4,5,NA)
mysumwrt <- c("CHILDID", "P3DOWRIT") 
Sumwrt <- subset(earlyedu,,mysumwrt)
Sumwrt$P3DOWRIT  <- str_replace_all(Sumwrt$P3DOWRIT,"^\\-9","NA")
Sumwrt$P3DOWRIT  <- str_replace_all(Sumwrt$P3DOWRIT,"^\\-8","NA")
Sumwrt$P3DOWRIT  <- str_replace_all(Sumwrt$P3DOWRIT,"^\\-7","NA")
Sumwrt$P3DOWRIT  <- str_replace_all(Sumwrt$P3DOWRIT,"^\\-1","NA")
colnames(Sumwrt) <- c("CHILDID", "Sumwrt")

# READ BOOKS TO CHILD IN SUMMER
sumrdlevels <- c(1,2,3,4,5,NA)
mysumrd <- c("CHILDID", "P3RDBKTC") 
Sumrd <- subset(earlyedu,,mysumrd)
Sumrd$P3RDBKTC  <- str_replace_all(Sumrd$P3RDBKTC,"^\\-9","NA")
Sumrd$P3RDBKTC  <- str_replace_all(Sumrd$P3RDBKTC,"^\\-8","NA")
Sumrd$P3RDBKTC  <- str_replace_all(Sumrd$P3RDBKTC,"^\\-7","NA")
Sumrd$P3RDBKTC  <- str_replace_all(Sumrd$P3RDBKTC,"^\\-1","NA")
colnames(Sumrd) <- c("CHILDID", "Sumrd")

# HOW LONG READ TO CHILD
sumreadtimelevels <- c(0, 1, 2, 3, 4,NA)
mysumreadtime  <- c("CHILDID", "P3HWLGRD") 
Sumreadtime <- subset(earlyedu,,mysumreadtime)
Sumreadtime$P3HWLGRD  <- str_replace_all(Sumreadtime$P3HWLGRD,"^\\-9","NA")
Sumreadtime$P3HWLGRD  <- str_replace_all(Sumreadtime$P3HWLGRD,"^\\-8","NA")
Sumreadtime$P3HWLGRD  <- str_replace_all(Sumreadtime$P3HWLGRD,"^\\-7","NA")
Sumreadtime$P3HWLGRD  <- str_replace_all(Sumreadtime$P3HWLGRD,"^\\-1","NA")
colnames(Sumreadtime) <- c("CHILDID", "Sumreadtime")

#HIGHEST ED LEVEL TEACHER ACHIEVED
myteachedu <- c("CHILDID", "A1HGHSTD") 
teachedu <- subset(earlyedu,,myteachedu)
teachedu$A1HGHSTD <- as.character(teachedu$A1HGHSTD)
teachedu$A1HGHSTD  <- str_replace_all(teachedu$A1HGHSTD,"^\\-9","NA")
teachedu$A1HGHSTD <- as.numeric(teachedu$A1HGHSTD)

## Warning: NAs introduced by coercion

colnames(teachedu) <- c("CHILDID", "teachedu")

# teacher enjpy current job
myteachjoy <- c("CHILDID", "A1ENJOY") 
teachjoy <- subset(earlyedu,,myteachjoy)
teachjoy$A1ENJOY <- as.character(teachjoy$A1ENJOY)
teachjoy$A1ENJOY  <- str_replace_all(teachjoy$A1ENJOY,"^\\-9","NA")
teachjoy$A1ENJOY <- as.numeric(teachjoy$A1ENJOY)

## Warning: NAs introduced by coercion

colnames(teachjoy) <- c("CHILDID", "teachjoy")

#teacher make difference in children's life
myteachdiff <- c("CHILDID", "A1MKDIFF") 
teachdiff <- subset(earlyedu,,myteachdiff)
teachdiff$A1MKDIFF <- as.character(teachdiff$A1MKDIFF)
teachdiff$A1MKDIFF <- str_replace_all(teachdiff$A1MKDIFF,"^\\-9","NA")
teachdiff$A1MKDIFF <- as.numeric(teachdiff$A1MKDIFF)

## Warning: NAs introduced by coercion

colnames(teachdiff) <- c("CHILDID", "teachdiff")

#teacher take exam on national board
myteachboard <- c("CHILDID", "A1NATEXM") 
teachboard <- subset(earlyedu,,myteachboard)
teachboard$A1NATEXM <- as.character(teachboard$A1NATEXM)
teachboard$A1NATEXM <- str_replace_all(teachboard$A1NATEXM,"^\\-9","NA")
teachboard$A1NATEXM <- as.numeric(teachboard$A1NATEXM)

## Warning: NAs introduced by coercion

colnames(teachboard) <- c("CHILDID", "teachboard")

#
#A1EARLY

earlychildhood <- inner_join(Age, Gender, by = "CHILDID") %>%
  inner_join(reading, by = "CHILDID") %>% 
  inner_join(math, by = "CHILDID") %>% 
  inner_join(science, by = "CHILDID") %>% 
  inner_join(DCCSTOT, by = "CHILDID") %>% 
  inner_join(DCCSSCR, by = "CHILDID") %>%  
  inner_join(GTK, by = "CHILDID") %>% 
  inner_join(GTS, by = "CHILDID") %>% 
  inner_join(Sep1cutoff, by = "CHILDID") %>%
  inner_join(Classsize, by = "CHILDID") %>%
  inner_join(PreK, by = "CHILDID") %>% 
  inner_join(Region, by = "CHILDID") %>% 
  inner_join(Prthlp, by = "CHILDID") %>% 
  inner_join(Prtwork, by = "CHILDID") %>% 
  inner_join(Parentedu, by = "CHILDID") %>% 
  inner_join(Prtconf, by = "CHILDID") %>% 
  inner_join(Prtoph, by = "CHILDID") %>% 
  inner_join(Prtevt, by = "CHILDID") %>% 
  inner_join(Prtpsk, by = "CHILDID") %>% 
  inner_join(Prtpremth, by = "CHILDID") %>% 
  inner_join(Prtprelitr, by = "CHILDID") %>% 
  inner_join(Prthw, by = "CHILDID") %>% 
  inner_join(Prtrc, by = "CHILDID") %>% 
  inner_join(Sumsch, by = "CHILDID") %>% 
  inner_join(Summth, by = "CHILDID") %>% 
  inner_join(Sumwrt, by = "CHILDID") %>% 
  inner_join(Sumrd, by = "CHILDID") %>% 
  inner_join(Sumreadtime, by = "CHILDID") %>% 
  inner_join(teachedu, by = "CHILDID") %>% 
  inner_join(teachjoy, by = "CHILDID") %>% 
  inner_join(teachdiff, by = "CHILDID") %>% 
  inner_join(teachboard, by = "CHILDID")
  
kable(head(earlychildhood,n=10))

CHILDID	K1_AGE	K2_AGE	G1_2_AGE	G2_2_AGE	CHSEX	K1_READ	K2_READ	G1_2_READ	G2_2_READ	K1_Math	K2_Math	G1_2_Math	G2_2_Math	K2_SCI	G1_2_SCI	G2_2_SCI	K1_DCCSTOT	K2_DCCSTOT	G1_2_DCCSTOT	G2_2_DCCSSCR	K2_GIFK	G1_2_GIFK	K2_GIFS	G1_2_GIFS	G2_2_Sep1Cut	G2_2_Sep1Cut_t	K2_CLSI	G1_2_CLSI	G2_2_CLSI	K2_CLSIt	G1_2_CLSIt	G2_2_CLSIt	PreK	PreKt	K1_LOC	K2_LOC	G1_2_LOC	G2_2_LOC	K1_LOCt	K2_LOCt	G1_2_LOCt	G2_2_LOCt	K2_Prthlp	G1_2_Prthlp	G2_2_Prthlp	K1_PrtW	G1_2_PrtW	G2_2_PrtW	K1_PrtWt	G1_2_PrtWt	G2_2_PrtWt	PrtEDU	PrtEDUt	K2_Prtconf	G1_2_Prtconf	G2_2_Prtconf	K2_Prtoph	G1_2_Prtoph	G2_2_Prtoph	G1_2_Prtevt	G2_2_Prtevt	G1_2_Prtpsk	G2_2_Prtpsk	Premath	Prelitr	Prthw	Prerc	Sumsch	Summth	Sumwrt	Sumrd	Sumreadtime	teachedu	teachjoy	teachdiff	teachboard
10000001	6.032500	6.646667	7.638333	8.638333	Male	0.1930	1.7614	3.1523	3.3637	0.5827	1.5384	2.7286	3.2950	1.3738	3.3354	2.5071	16	17	16	7.3280	NA	no	NA	yes	0	yes	no	yes	yes	2	1	1	3	NEITHER AGREE NOR DISAGREE	4	4	4	4	RURAL	RURAL	RURAL	RURAL	3	2	2	NA	NA	NA	NA	NA	NA	6	MASTER	5	5	5	3	5	3	5	5	2	1	3	2	3	4	NA	NA	NA	NA	NA	5	5	5	1
10000002	5.950833	6.460000	7.463333	NA	Female	-0.7870	0.7452	2.2023	NA	-0.3473	0.8800	2.0825	NA	0.8811	1.3122	NA	17	16	16	NA	yes	no	yes	yes	NA	NA	no	NA	NA	2	NA	NA	3	NEITHER AGREE NOR DISAGREE	2	2	2	NA	SUBURB	SUBURB	SUBURB	NA	2	4	NA	NA	NA	NA	NA	NA	NA	5	BACHELOR	5	5	NA	5	4	NA	4	NA	NA	NA	2	3	2	5	NA	NA	NA	NA	NA	5	5	5	1
10000003	NA	6.125833	7.027500	8.021667	Male	NA	0.3323	1.1861	2.0689	NA	1.0112	2.2080	2.8384	0.4244	2.2479	2.3529	NA	14	16	8.2395	yes	no	no	yes	1	no	NA	NA	NA	NA	NA	NA	NA	NA	NA	2	2	2	NA	SUBURB	SUBURB	SUBURB	2	2	4	NA	a35h	a35h	NA	1	1	NA	NA	5	5	5	5	5	5	5	4	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
10000004	6.032500	6.570000	7.435833	8.432500	Male	-1.7087	-0.2922	0.8632	1.5117	-2.1245	-0.2834	1.1082	2.8392	0.6509	0.2933	1.7845	15	14	17	7.2868	NA	yes	NA	yes	0	yes	no	no	yes	2	2	1	5	STRONGLY AGREE	4	4	4	4	RURAL	RURAL	RURAL	RURAL	2	2	3	a35h	a35h	NA	1	1	NA	5	BACHELOR	3	3	3	3	4	3	5	3	1	1	5	5	4	5	NA	NA	NA	NA	NA	5	5	5	2
10000005	4.959167	5.421667	NA	NA	Female	-0.0734	0.8013	NA	NA	-0.5545	0.6163	NA	NA	0.3996	NA	NA	14	17	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	5	STRONGLY AGREE	2	2	NA	NA	SUBURB	SUBURB	NA	NA	2	NA	NA	a35h	NA	NA	1	NA	NA	3	SOME COLLEGE	5	NA	NA	5	NA	NA	NA	NA	NA	NA	5	5	5	5	NA	NA	NA	NA	NA	4	5	5	1
10000006	6.315000	6.885000	7.901667	8.959167	Female	-1.4601	-0.6792	1.0132	1.9750	-1.0616	-0.5379	0.4484	1.0869	-0.6320	-0.7748	-0.2110	16	14	18	5.3872	NA	yes	NA	yes	0	yes	no	yes	yes	2	1	1	4	AGREE	4	4	4	4	RURAL	RURAL	RURAL	RURAL	2	2	2	a35h	a35h	NA	1	1	NA	2	HIGH SCHOOL	5	5	5	2	5	2	5	5	1	2	3	3	3	4	NA	NA	NA	NA	NA	5	5	4	2
10000007	5.329167	5.825000	6.748333	7.797500	Female	-0.3363	0.8570	1.3456	2.4449	-0.6902	0.1032	1.4144	2.4619	0.0669	0.2036	1.0675	14	17	18	6.9906	yes	no	no	no	0	yes	no	no	NA	2	2	NA	4	AGREE	3	3	3	3	TOWN	TOWN	TOWN	TOWN	2	2	2	blw35h	blw35h	NA	2	2	NA	2	HIGH SCHOOL	5	5	5	4	5	4	5	4	2	2	4	3	4	5	NA	NA	NA	NA	NA	6	5	5	1
10000008	5.345000	5.994167	6.967500	8.005833	Female	-0.4149	0.6023	2.3156	2.2707	-0.2921	1.1411	2.7230	3.1506	0.7350	1.1363	2.5198	17	17	18	6.5526	NA	yes	NA	yes	0	yes	NA	NA	NA	NA	NA	NA	5	STRONGLY AGREE	2	2	2	2	SUBURB	SUBURB	SUBURB	SUBURB	3	2	2	NA	NA	NA	NA	NA	NA	4	ASSOCIATE'S DEGREE	5	5	4	5	5	3	5	2	NA	NA	5	5	4	5	NA	NA	NA	NA	NA	6	5	5	1
10000009	5.693333	6.197500	NA	NA	Female	-1.7711	-0.6955	NA	NA	-1.9759	-0.3111	NA	NA	0.4266	NA	NA	15	15	NA	NA	NA	NA	NA	NA	NA	NA	no	NA	NA	2	NA	NA	4	AGREE	1	1	NA	NA	CITY	CITY	NA	NA	2	NA	NA	blw35h	NA	NA	2	NA	NA	5	BACHELOR	4	NA	NA	3	NA	NA	NA	NA	NA	NA	4	3	4	5	NA	NA	NA	NA	NA	6	5	4	NA
10000010	5.808333	6.265833	7.378333	8.285000	Male	-1.7137	-1.2014	0.2007	1.0282	-2.0032	-0.6192	0.7475	1.4483	-1.7469	-0.7688	-1.0706	15	16	13	2.8750	yes	yes	yes	yes	1	no	yes	yes	NA	1	1	NA	3	NEITHER AGREE NOR DISAGREE	1	1	1	1	CITY	CITY	CITY	CITY	1	1	1	NA	NA	NA	NA	NA	NA	3	SOME COLLEGE	5	5	5	5	5	4	2	1	2	NA	3	3	4	4	2	3	3	2	2	6	5	5	1

Part 3 - Exploratory data analysis:

Distribution of reading score(G2_2_Math)

summary(reading$G2_2_READ)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -0.221   1.824   2.231   2.189   2.612   3.829    4337

Histogram

hist(reading$G2_2_READ)

Box plot.

boxplot(reading$G2_2_READ)

Normal probability plot.

qqnorm(reading$G2_2_READ)
qqline(reading$G2_2_READ)

The distribution of the second grade reading scores is nearly normal and slghtly skewed to the right. The lower score deviated from the line in QQ plot. But the sample size is big so the sightly skewness will not be a concern.

Distribution of math score(G2_2_Math)

summary(math$G2_2_Math)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -2.845   1.940   2.539   2.453   3.055   6.537    4344

Histogram

hist(math$G2_2_Math)

qqnorm(math$G2_2_Math)
qqline(math$G2_2_Math)

The distribution of the second grade math scores is nearly normal and slghtly skewed to the right. There are deviations on both lower- and higher- end from the line in QQ plot. But the sample size is big so the deviations will not be a concern.

3.Distribution of science score(G2_2_SCI)

summary(science$G2_2_SCI)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -4.490   1.096   1.719   1.587   2.236   5.370    4355

hist(science$G2_2_SCI)

qqnorm(science$G2_2_SCI)
qqline(science$G2_2_SCI)

The distribution of the second grade science scores is nearly normal and slghtly skewed to the right. There are deviations on both lower- and higher- end from the line in QQ plot. But the sample size is big so the deviations will not be a concern.

Distribution of Dimensional Change Card Sort (DCCSSCR) socore.

summary(DCCSSCR$G2_2_DCCSSCR)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.625   6.329   6.983   6.689   7.526  10.000    4400

hist(DCCSSCR$G2_2_DCCSSCR)

qqnorm(DCCSSCR$G2_2_DCCSSCR)
qqline(DCCSSCR$G2_2_DCCSSCR)

The distribution of the second grade Dimensional Change Card Sort (DCCSSCR) scores is bimodal. There are strong deviations on lower-end from the line in QQ plot. But the sample size is big so the deviations will not be a concern.

Distribution of unadjusted Dimensional Change Card Sort (DCCSTOT) socore for first grade.

The distribution of second grade DCCS score is not so ideal, I looked into the DCCS score in first grades, in which the game design and the score calculation was different from second grade.

summary(DCCSTOT$G1_2_DCCSTOT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   15.00   17.00   16.05   18.00   18.00    3065

hist(DCCSTOT$G1_2_DCCSTOT)

qqnorm(DCCSTOT$G1_2_DCCSTOT)
qqline(DCCSTOT$G1_2_DCCSTOT)

The distribution of the second grade Dimensional Change Card Sort (DCCSTOT) scores is strongly skewd to the right. The distribution is not normal according to the QQ plot. Similar pattern has been seen to kingdergarten DCCSTOT score. So it is not appropriate to use these socores as response variablesare.

The relationship between reading score and math /science/DCCSSCR score.

plot(earlychildhood$G2_2_READ,earlychildhood$G2_2_Math)

cor(earlychildhood$G2_2_READ,earlychildhood$G2_2_Math,use = "complete.obs")

## [1] 0.7328452

The math score increases with the increase of reading score. There is a strong (Pearson's r > 0.6) positive linear realtionship between the math sore and  the reading score.

plot(earlychildhood$G2_2_READ,earlychildhood$G2_2_SCI)

cor(earlychildhood$G2_2_READ,earlychildhood$G2_2_SCI,use = "complete.obs")

## [1] 0.6957673

The science score increases with the increase of reading score. There is a strong (Pearson's r > 0.6) positive linear realtionship between the math sore and  the reading score.

plot(earlychildhood$G2_2_READ,earlychildhood$G2_2_DCCSSCR)

cor(earlychildhood$G2_2_READ,earlychildhood$G2_2_DCCSSCR,use = "complete.obs")

## [1] 0.3928881

The DCCSSCR score slightly increases with the increase of reading score. There is a weak (Pearson's r < 0.4) positive linear realtionship between the math sore and  the reading score.

6.Is there difference between the academy performance between boys and girls?

sub_1 <- earlychildhood[which(earlychildhood$CHSEX != "NA"),]
boxplot(G2_2_READ ~ CHSEX, data=sub_1)

girl <- earlychildhood[which(earlychildhood$CHSEX == 'Female'),]
boy <-  earlychildhood[which(earlychildhood$CHSEX == 'Male'),]
t.test(girl$G2_2_READ,boy$G2_2_READ)

## 
##  Welch Two Sample t-test
## 
## data:  girl$G2_2_READ and boy$G2_2_READ
## t = 13.056, df = 13791, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1201849 0.1626479
## sample estimates:
## mean of x mean of y 
##  2.262143  2.120727

The medium of the reading score of girls is slightly higher than that of boys. T-test results suggest that there is significant difference between the average reading score of girls and that of boys as p-value < 2.2e-16.

boxplot(G2_2_Math ~ CHSEX, data=sub_1)

t.test(girl$G2_2_Math,boy$G2_2_Math)

## 
##  Welch Two Sample t-test
## 
## data:  girl$G2_2_Math and boy$G2_2_Math
## t = -5.7726, df = 13694, p-value = 7.976e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1075887 -0.0530442
## sample estimates:
## mean of x mean of y 
##  2.413426  2.493743

T-test results suggest that there is siglthy but significant difference between the average math score of girls and that of boys. But I think the difference is so small that we could say that the average math score of girls and boys are about the same.

boxplot(G2_2_SCI ~ CHSEX, data=sub_1)

t.test(girl$G2_2_SCI,boy$G2_2_SCI)

## 
##  Welch Two Sample t-test
## 
## data:  girl$G2_2_SCI and boy$G2_2_SCI
## t = -4.124, df = 13793, p-value = 3.745e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.09743361 -0.03465322
## sample estimates:
## mean of x mean of y 
##  1.553854  1.619897

T-test results suggest that there is siglthy but significant difference between the average science score of girls and that of boys. But I think the difference is so small that we could say that the average science score of girls and boys are about the same.

boxplot(G2_2_DCCSSCR ~ CHSEX, data=sub_1)

t.test(girl$G2_2_DCCSSCR,boy$G2_2_DCCSSCR)

## 
##  Welch Two Sample t-test
## 
## data:  girl$G2_2_DCCSSCR and boy$G2_2_DCCSSCR
## t = 8.0041, df = 13618, p-value = 1.301e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1387685 0.2287779
## sample estimates:
## mean of x mean of y 
##  6.783344  6.599571

The medium of the DCCSSCR score of girls is slightly higher than that of boys. T-test results suggest that there is significant difference between the average DCCSSCR score of girls and that of boys.

Part 4 - Inference:

Simple linear regression

Parent involvment related to early literacy is believed to have positive effects on children's academy performance.

1.1 First we could create a scatterplot to see if there is positive correlation between the parents' opinion on whether parents help in children's homework to the reading score:

sub_2 <- earlychildhood[which(earlychildhood$Prthw != "NA"),]
sub_2$Prthw <- as.numeric(sub_2$Prthw)
plot(sub_2$G2_2_READ ~ sub_2$Prthw)

m_Prthw <- lm(sub_2$G2_2_READ ~ sub_2$Prthw)
m_Prthw

## 
## Call:
## lm(formula = sub_2$G2_2_READ ~ sub_2$Prthw)
## 
## Coefficients:
## (Intercept)  sub_2$Prthw  
##     2.47035     -0.06722

the equation for the linear model:
$$\widehat{score_{reading}} = 2.470354-0.06722\times Prthw $$

plot(jitter(sub_2$G2_2_READ, factor= 1.2) ~ jitter(sub_2$Prthw, factor=1.2))+
abline(m_Prthw)

## numeric(0)

cor(sub_2$G2_2_READ, sub_2$Prthw, use = "complete.obs")

## [1] -0.1001445

summary(m_Prthw)

## 
## Call:
## lm(formula = sub_2$G2_2_READ ~ sub_2$Prthw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.42257 -0.36007  0.03647  0.41743  1.69505 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.470347   0.026311   93.89   <2e-16 ***
## sub_2$Prthw -0.067219   0.006213  -10.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6326 on 11555 degrees of freedom
##   (3650 observations deleted due to missingness)
## Multiple R-squared:  0.01003,    Adjusted R-squared:  0.009943 
## F-statistic: 117.1 on 1 and 11555 DF,  p-value: < 2.2e-16

The average reading score will decrease 0.067219 for every one point increase in the parents' attitute on whether parents should help children's homework. 

p-value is < 2.2e-16, which is less than 0.05 ,suggesting the parents' attitute on whether parents should help children' homework is a statistically significant predictor. It may not be a practically significant predictor because there is very weak correlation between the average reading score and the parents' attitute on whether parents should help children' homework with Multiple R-squared equals to 0.01003. For every one point increase in the parents' attitute on whether parents should help children' homework, the model only predicts an decrease of  0.067219 in the average reading score, which barely changes the score.

1.2 Use residual plots to evaluate whether the conditions of least squares regression are reasonable.

yx.res <- resid(m_Prthw, na.rm=T)

par(mfrow = c(1,2))
hist(yx.res, xlab="Residuals", breaks = 10)

sub_2 <- sub_2[which(sub_2$G2_2_READ != "NA"), ]
sub_2$Prthw <- as.numeric(sub_2$Prthw)
a <- (2.470354-0.06722*sub_2$Prthw)
plot((2.470354-0.06722*sub_2$Prthw), yx.res, ylab="Residuals", xlab="fitted values", main="Parents homework help") 

abline(0, 0)

Following conditions were checked to evaluate whether the conditions of least squares regression are reasonable:

Linearity:the variable parent should help homework is linearly related to the reading score.

Nearly normal residuals:The distribution of residuals normal.

Constant variabilities:The variance around the line is constant.

Independent observation:Each student's parent's opinion on whether parent should help children's homework is indenpendent to each other.

2.1 Then let's see if there is positive correlation between the summer reading time to the reading score:

sub_4 <- earlychildhood[which(earlychildhood$Sumrd != "NA"),]
sub_4$Sumrd <- as.numeric (sub_4$Sumrd)
m_sumrd <- lm(sub_4$G2_2_READ ~ sub_4$Sumrd)
m_sumrd

## 
## Call:
## lm(formula = sub_4$G2_2_READ ~ sub_4$Sumrd)
## 
## Coefficients:
## (Intercept)  sub_4$Sumrd  
##       1.805        0.117

the equation for the linear model:
$$\widehat{score_{reading}} = 1.805 + 0.117\times Sumrd $$

plot(jitter(sub_4$G2_2_READ, factor= 1.2) ~ jitter(sub_4$Sumrd, factor=1.2))+
abline(m_sumrd)

## numeric(0)

cor(sub_4$G2_2_READ, sub_4$Sumrd,use = "complete.obs")

## [1] 0.1482586

summary(m_sumrd)

## 
## Call:
## lm(formula = sub_4$G2_2_READ ~ sub_4$Sumrd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.47357 -0.37566  0.02997  0.43454  1.67382 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.80453    0.03851  46.855   <2e-16 ***
## sub_4$Sumrd  0.11699    0.01176   9.947   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6592 on 4402 degrees of freedom
##   (548 observations deleted due to missingness)
## Multiple R-squared:  0.02198,    Adjusted R-squared:  0.02176 
## F-statistic: 98.93 on 1 and 4402 DF,  p-value: < 2.2e-16

The average reading score will increase 0.11699 for every one point increase in summer reading. 

p-value is < 2.2e-16, which is less than 0.05 ,suggesting the summer reading is a statistically significant predictor. It may not be a practically significant predictor because there is very weak correlation between the average reading score and the summer reading with Multiple R-squared equals to 0.02198. For every one point increase in summer reading, the model only predicts an increase of  0.11699 in the average reading score, which barely changes the score.

2.2 Use residual plots to evaluate whether the conditions of least squares regression are reasonable.

yx.res <- resid(m_sumrd, na.rm=T)

par(mfrow = c(1,2))
hist(yx.res, xlab="Residuals", breaks = 10)

sub_4 <- sub_4[which(sub_4$G2_2_READ != "NA"), ]
sub_4$Sumrd <- as.numeric(sub_4$Sumrd)
a <- (1.805 + 0.117*sub_4$Sumrd)
plot((1.805 + 0.117*sub_4$Sumrd), yx.res, ylab="Residuals", xlab="fitted values", main="Summer reading") 

abline(0, 0)

Following conditions were checked to evaluate whether the conditions of least squares regression are reasonable:

Linearity:the variable parent should help homework is linearly related to the reading score.

Nearly normal residuals:The distribution of residuals normal.

Constant variabilities:The variance around the line is constant.

Independent observation:Each student's parent's opinion on whether parent should help children's homework is indenpendent to each other.

3.Is there positive correlation between the summer reading time to the reading score?

sub_5 <- earlychildhood[which(earlychildhood$Sumreadtime != "NA"),]
sub_5$Sumreadtime <- as.numeric (sub_5$Sumreadtime)
m_sumreadtime <- lm(sub_5$G2_2_READ ~ sub_5$Sumreadtime)
m_sumreadtime

## 
## Call:
## lm(formula = sub_5$G2_2_READ ~ sub_5$Sumreadtime)
## 
## Coefficients:
##       (Intercept)  sub_5$Sumreadtime  
##           2.13244            0.02566

the equation for the linear model:
$$\widehat{score_{readingtime}} = 2.13244 + 0.02566\times Sumreadtime $$

plot(jitter(sub_5$G2_2_READ, factor= 1.2) ~ jitter(sub_5$Sumreadtime, factor=1.2))+
abline(m_sumreadtime)

## numeric(0)

cor(sub_5$G2_2_READ, sub_5$Sumreadtime,use = "complete.obs")

## [1] 0.03000113

summary(m_sumreadtime)

## 
## Call:
## lm(formula = sub_5$G2_2_READ ~ sub_5$Sumreadtime)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.38487 -0.39157  0.02553  0.43288  1.67119 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.13244    0.02522  84.563   <2e-16 ***
## sub_5$Sumreadtime  0.02566    0.01306   1.965   0.0495 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6636 on 4284 degrees of freedom
##   (533 observations deleted due to missingness)
## Multiple R-squared:  0.0009001,  Adjusted R-squared:  0.0006669 
## F-statistic: 3.859 on 1 and 4284 DF,  p-value: 0.04953

The average reading score will increase 0.02566 for every one point increase in summer reading time. 

p-value is 0.04953, which is less than 0.05 ,suggesting the summer reading time is a statistically significant predictor. It may not be a practically significant predictor because there is very weak correlation between the average reading score and the summer reading time with Multiple R-squared equals to 0.0009001. For every one point increase in summer reading time, the model only predicts an increase of 0.02566 in the average reading score, which barely changes the score.

3.2 Use residual plots to evaluate whether the conditions of least squares regression are reasonable.

yx.res <- resid(m_sumreadtime, na.rm=T)

par(mfrow = c(1,2))
hist(yx.res, xlab="Residuals", breaks = 10)

sub_5 <- sub_5[which(sub_5$G2_2_READ != "NA"), ]
sub_5$Sumreadtime <- as.numeric(sub_5$Sumreadtime)
a <- (2.13244 + 0.02566*sub_5$Sumreadtime)
plot((2.13244 + 0.02566*sub_5$Sumreadtime), yx.res, ylab="Residuals", xlab="fitted values", main="Summer reading time") 

abline(0, 0)

Following conditions were checked to evaluate whether the conditions of least squares regression are reasonable:

Linearity:the variable parent should help homework is linearly related to the reading score.

Nearly normal residuals:The distribution of residuals normal.

Constant variabilities:The variance around the line is constant.

Independent observation:Each student's parent's opinion on whether parent should help children's homework is indenpendent to each other.

4.1 Is there positive correlation between the parent volenteer time at school to the reading score?

sub_6 <- earlychildhood[which(earlychildhood$G2_2_Prthlp != "NA"),]
sub_6$G2_2_Prthlp <- as.numeric (sub_6$G2_2_Prthlp)
m_G2_2_Prthlp <- lm(sub_6$G2_2_READ ~ sub_6$G2_2_Prthlp)
m_G2_2_Prthlp

## 
## Call:
## lm(formula = sub_6$G2_2_READ ~ sub_6$G2_2_Prthlp)
## 
## Coefficients:
##       (Intercept)  sub_6$G2_2_Prthlp  
##            1.9003             0.1356

the equation for the linear model:
$$\widehat{score_{prthlp}} = 1.9003 + 0.1356\times Prthlp $$

plot(jitter(sub_6$G2_2_READ, factor= 1.2) ~ jitter(sub_6$G2_2_Prthlp, factor=1.2))+
abline(m_G2_2_Prthlp)

## numeric(0)

cor(sub_6$G2_2_Prthlp, sub_6$G2_2_READ,use = "complete.obs")

## [1] 0.2006117

summary(m_G2_2_Prthlp)

## 
## Call:
## lm(formula = sub_6$G2_2_READ ~ sub_6$G2_2_Prthlp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.66368 -0.35602  0.03544  0.41270  1.70060 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.90034    0.01436  132.36   <2e-16 ***
## sub_6$G2_2_Prthlp  0.13556    0.00598   22.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6245 on 12255 degrees of freedom
##   (145 observations deleted due to missingness)
## Multiple R-squared:  0.04025,    Adjusted R-squared:  0.04017 
## F-statistic: 513.9 on 1 and 12255 DF,  p-value: < 2.2e-16

The average reading score will increase 0.13556 for every one point increase in parents's volunteer time at school. 

p-value is < 2.2e-16, which is less than 0.05 ,suggesting the parents's volunteer time at school is a statistically significant predictor. It may not be a practically significant predictor because there is very weak correlation between the average reading score and the parents's volunteer time at school with Multiple R-squared equals to 0.04025. For every one point increase in parents's volunteer time at school, the model only predicts an increase of 0.13556 in the average reading score, which barely changes the score.

4.2 Use residual plots to evaluate whether the conditions of least squares regression are reasonable.

yx.res <- resid(m_G2_2_Prthlp, na.rm=T)

par(mfrow = c(1,2))
hist(yx.res, xlab="Residuals", breaks = 10)

sub_6 <- sub_6[which(sub_6$G2_2_READ != "NA"), ]
sub_6$G2_2_Prthlp <- as.numeric(sub_6$G2_2_Prthlp)
a <- (1.9003 + 0.1356*sub_6$G2_2_Prthlp)
plot((1.9003 + 0.1356*sub_6$G2_2_Prthlp), yx.res, ylab="Residuals", xlab="fitted values", main="Parent school helping") 

abline(0, 0)

Following conditions were checked to evaluate whether the conditions of least squares regression are reasonable:

Linearity:the variable parent should help homework is linearly related to the reading score.

Nearly normal residuals:The distribution of residuals normal.

Constant variabilities:The variance around the line is constant.

Independent observation:Each student's parent's opinion on whether parent should help children's homework is indenpendent to each other.

Multiple linear regression

For the convenience of analysis, I extract the second grade academic outcomes and related variables.

# subset second grade and related vaviables
g2read <- c("CHILDID","G2_2_AGE","CHSEX","G2_2_READ","G2_2_Math","G2_2_SCI", "G2_2_DCCSSCR","G2_2_Sep1Cut","G2_2_CLSIt", "PreK", "G2_2_LOC",  "G2_2_Prthlp", "G2_2_PrtWt","PrtEDU", "G2_2_Prtconf",  "G2_2_Prtoph", "G2_2_Prtevt", "G2_2_Prtpsk", "Premath"  ,"Prelitr" , "Prthw", "Prerc" ,"Sumsch","Summth", "Sumwrt", "Sumrd", "Sumreadtime","teachedu","teachjoy","teachdiff","teachboard")
sub_g2read <- subset(earlychildhood,,g2read)

sub_g2read$G2_2_CLSIt <- as.numeric(sub_g2read$G2_2_CLSIt)

## Warning: NAs introduced by coercion

sub_g2read$PreK <- as.numeric(sub_g2read$PreK)

## Warning: NAs introduced by coercion

sub_g2read$G2_2_LOC <- as.numeric(sub_g2read$G2_2_LOC)

## Warning: NAs introduced by coercion

sub_g2read$G2_2_Prthlp <- as.numeric(sub_g2read$G2_2_Prthlp)

## Warning: NAs introduced by coercion

sub_g2read$G2_2_PrtWt <- as.numeric(sub_g2read$G2_2_PrtWt)

## Warning: NAs introduced by coercion

sub_g2read$PrtEDU <-as.numeric(sub_g2read$PrtEDU)

## Warning: NAs introduced by coercion

sub_g2read$G2_2_Prtconf <-as.numeric(sub_g2read$G2_2_Prtconf)

## Warning: NAs introduced by coercion

sub_g2read$G2_2_Prtoph <- as.numeric(sub_g2read$G2_2_Prtoph)

## Warning: NAs introduced by coercion

sub_g2read$G2_2_Prtevt <-as.numeric(sub_g2read$G2_2_Prtevt)
sub_g2read$G2_2_Prtpsk <- as.numeric(sub_g2read$G2_2_Prtpsk)
sub_g2read$Premath <- as.numeric(sub_g2read$Premath)

## Warning: NAs introduced by coercion

sub_g2read$Prelitr <- as.numeric(sub_g2read$Prelitr)

## Warning: NAs introduced by coercion

sub_g2read$Prthw <- as.numeric(sub_g2read$Prthw)

## Warning: NAs introduced by coercion

sub_g2read$Prerc <- as.numeric(sub_g2read$Prerc)

## Warning: NAs introduced by coercion

sub_g2read$Sumsch <- as.numeric(sub_g2read$Prerc)
sub_g2read$Summth <- as.numeric(sub_g2read$Summth)

## Warning: NAs introduced by coercion

sub_g2read$Sumwrt <- as.numeric(sub_g2read$Sumwrt)

## Warning: NAs introduced by coercion

sub_g2read$Sumrd <- as.numeric(sub_g2read$Sumrd)

## Warning: NAs introduced by coercion

sub_g2read$Sumreadtime <- as.numeric(sub_g2read$Sumreadtime)

## Warning: NAs introduced by coercion

Q1.1. Do parent involvement in preschool education influence child early academy outcomes?

Miedel and Reynolds (1999) detected positive associations between parent involvement in preschool and kindergarten and reading achievement in kindergarten and in eighth grade. I want to see whether parent involvement in child's preschool literacy education associates to second grade academic outcomes.

The data set contains several variables on the parent involvement related to early literacy: children go to preschool before K (G2_2_Prtpsk), preschool read/math good for school (Premath), have child know alphabet before K (Prelitr), parent provide homework time (Prthw), parent should read and count with child(Prerc).

Q1.1.1.First look at the relationship between one of these variables and reading score.

plot(sub_g2read$G2_2_READ ~ sub_g2read$G2_2_Prtpsk)

cor(sub_g2read$G2_2_READ, sub_g2read$G2_2_Prtpsk,use = "complete.obs")

## [1] 0.09137424

There is very weak relationship between children go to preschool before K and reading score.

cor(sub_g2read$G2_2_READ, sub_g2read$Premath,use = "complete.obs")

## [1] -0.09509362

cor(sub_g2read$G2_2_READ, sub_g2read$Prelitr,use = "complete.obs")

## [1] -0.04224958

cor(sub_g2read$G2_2_READ, sub_g2read$Prthw,use = "complete.obs")

## [1] -0.1001445

cor(sub_g2read$G2_2_READ, sub_g2read$Prerc,use = "complete.obs")

## [1] 0.003368523

There are very weak relationship (positive or negative) between all varaibles on parent involvement related to preschool literacy. All relationships between all varaibles on parent involvement related to early literacy could be seen by using the following command:

g2read_prt <- c("G2_2_READ","G2_2_Prtpsk", "Premath"  ,"Prelitr" , "Prthw", "Prerc")
sub_g2read_prt <- subset(sub_g2read,,g2read_prt)
plot(sub_g2read_prt[,1:6])

Q1.1.2.Search for the best model

I will start with a full model that predicts reading score based on children go to preschool before K (G2_2_Prtpsk), preschool read/math good for school (Premath), have child know alphabet before K (Prelitr), parent provide homework time (Prthw), and parent should read and count with child(Prerc).

sub_q1 <- sub_g2read[!is.na(sub_g2read$G2_2_Prtpsk),]  
sub_q1 <- sub_q1[!is.na(sub_q1$Premath),]  
sub_q1 <- sub_q1[!is.na(sub_q1$Prelitr),] 
sub_q1 <- sub_q1[!is.na(sub_q1$Prthw),]
sub_q1 <- sub_q1[!is.na(sub_q1$Prerc),] 
sub_q1 <- sub_q1[!is.na(sub_q1$G2_2_READ),]

m_full <- lm(G2_2_READ ~ G2_2_Prtpsk + Premath + Prelitr + Prthw + Prerc, data = sub_q1)
summary(m_full)

## 
## Call:
## lm(formula = G2_2_READ ~ G2_2_Prtpsk + Premath + Prelitr + Prthw + 
##     Prerc, data = sub_q1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.35048 -0.35064  0.04447  0.40997  1.78065 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.018191   0.101270  19.929  < 2e-16 ***
## G2_2_Prtpsk  0.137998   0.019098   7.226 5.54e-13 ***
## Premath     -0.031712   0.009440  -3.359 0.000786 ***
## Prelitr      0.002319   0.009812   0.236 0.813161    
## Prthw       -0.043327   0.009802  -4.420 1.00e-05 ***
## Prerc        0.028466   0.019591   1.453 0.146272    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6277 on 6639 degrees of freedom
## Multiple R-squared:  0.01746,    Adjusted R-squared:  0.01672 
## F-statistic:  23.6 on 5 and 6639 DF,  p-value: < 2.2e-16

Using backward-selection and p-value as the selection criterion, determine the best model. Drop the variable with the highest p-value, which are PreLitr and Prerc, and re-fit the model.

m_backward <- lm(G2_2_READ ~ G2_2_Prtpsk + Premath + Prthw, data = sub_q1)
summary(m_backward)

## 
## Call:
## lm(formula = G2_2_READ ~ G2_2_Prtpsk + Premath + Prthw, data = sub_q1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.35153 -0.35053  0.04417  0.41251  1.78889 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.143248   0.055338  38.730  < 2e-16 ***
## G2_2_Prtpsk  0.140231   0.019032   7.368 1.94e-13 ***
## Premath     -0.030383   0.008587  -3.538 0.000405 ***
## Prthw       -0.040514   0.009435  -4.294 1.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6277 on 6641 degrees of freedom
## Multiple R-squared:  0.01714,    Adjusted R-squared:  0.01669 
## F-statistic:  38.6 on 3 and 6641 DF,  p-value: < 2.2e-16

After dropping PreLitr and Prerc, the coefficients and significance of the other variables were unchanged, suggesting that the dropped variable was not collinear with other variables.

linear model:
$$\widehat{Score}_{reading} = 2.143248 + 0.140231\times G2_2_Prtpsk - 0.030383\times Premath - 0.040514\times Prthw $$
3.Verify that the conditions for this model are reasonable using diagnostic plots.

3.1 Normal probability plot.

qqnorm(m_backward$residuals)
qqline(m_backward$residuals)

The residuals of the model are nearly normal as shown in the QQ plot. While there are a few observations that deviate noticeably from the line, they are not particularly extreme.

Q1.1.3.2 Absolute values of residuals against fitted values ($\hat{y_i}$).

fitted_backward <- 2.137331 + 0.139253* sub_q1$G2_2_Prtpsk -0.029061 * sub_q1$Premath - 0.040143 * sub_q1$Prthw

plot(round(fitted_backward,1),m_backward$residuals,ylab="Absolute value of residuals", xlab="Fitted values")

The plot shows that the variance of the residuals is approximately constant.

Residuals in order of their data collection is not applicable in this data set because we second grade scores are collected at the same time.

Q1.1.3.3 Residuals against each predictor variable.

boxplot(m_backward$residuals~G2_2_Prtpsk,data=sub_q1,ylab="Residuals", main="G2_2_Prtpsk")

plot(m_backward$residuals~Premath,data=sub_q1,ylab="Residuals", main="Premath")

plot(m_backward$residuals~Prthw,data=sub_q1,ylab="Residuals", main="Prthw")

The plot shows that the variance of the residuals is approximately constant except there are some deviations at low fitted values.

Based on the multiple linear model, the student who went to preschool before K and whose parent believe that preschool read/math is good for school and provide homework time to children will have higher reading score. But overall the correlation is weak (Multiple R-squared:  0.01714).

Q1.2. Do parent involvement in school activities influence child early academy outcomes?

School encourage parent involvement by inviting parents to participate in activities at school such as school open house, general school meetingand, regularly scheduled parent teacher meetings, and parent volunteering in the classroom etc. and also encourage facilitating parent-teacher communication. In this project, I will study contemporaneous association between parent involvement in school activities and reading achievement.

Q1.2.1.Search for the best model

full model: predicts reading score based on parent volenteering at school(G2_2_Prthlp), parent attending school conference(G2_2_Prtconf), parent attending school open house(G2_2_Prtoph), and parent attending school art/music events(G2_2_Prtevt).

sub_q1 <- sub_g2read[!is.na(sub_g2read$G2_2_Prthlp),]  
sub_q1 <- sub_q1[!is.na(sub_q1$G2_2_Prtconf),]  
sub_q1 <- sub_q1[!is.na(sub_q1$G2_2_Prtoph),] 
sub_q1 <- sub_q1[!is.na(sub_q1$G2_2_Prtevt),]
sub_q1 <- sub_q1[!is.na(sub_q1$G2_2_READ),]

m_full <- lm(G2_2_READ ~ G2_2_Prthlp + G2_2_Prtconf + G2_2_Prtoph + G2_2_Prtevt, data = sub_q1)
summary(m_full)

## 
## Call:
## lm(formula = G2_2_READ ~ G2_2_Prthlp + G2_2_Prtconf + G2_2_Prtoph + 
##     G2_2_Prtevt, data = sub_q1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.51444 -0.35236  0.03851  0.41254  1.72053 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.526989   0.028077  54.386  < 2e-16 ***
## G2_2_Prthlp  0.082554   0.006725  12.276  < 2e-16 ***
## G2_2_Prtconf 0.051295   0.006656   7.706 1.40e-14 ***
## G2_2_Prtoph  0.031835   0.006369   4.998 5.87e-07 ***
## G2_2_Prtevt  0.042076   0.005493   7.660 2.00e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6171 on 12103 degrees of freedom
## Multiple R-squared:  0.06441,    Adjusted R-squared:  0.0641 
## F-statistic: 208.3 on 4 and 12103 DF,  p-value: < 2.2e-16

All variables are significant and there is no need to do backward-selection.

linear model:
$$\widehat{Score}_{reading} = 1.526989 + 0.082554\times G2_2_Prthlp + 0.051295\times G2_2_Prtconf + 0.031835\times G2_2_Prtoph + 0.042076\times G2_2_Prtevt$$

Q1.2.2.Verify that the conditions for this model are reasonable using diagnostic plots.

Q1.2.2.1 Normal probability plot.

qqnorm(m_full$residuals)
qqline(m_full$residuals)

The residuals of the model are nearly normal as shown in the QQ plot. While there are a few observations that deviate noticeably from the line, they are not particularly extreme.

Q1.2.2.2 Absolute values of residuals against fitted values ($\hat{y_i}$).

fitted_full <- 1.526989 + 0.082554*sub_q1$G2_2_Prthlp + 0.051295*sub_q1$G2_2_Prtconf + 0.0318358 *sub_q1$G2_2_Prtoph + 0.042076*sub_q1$G2_2_Prtevt

plot(round(fitted_full,1),m_full$residuals,ylab="Absolute value of residuals", xlab="Fitted values")

Q1.2.2.3 Residuals against each predictor variable.

plot(m_full$residuals~G2_2_Prthlp,data=sub_q1,ylab="Residuals", main="G2_2_Prthlp")

plot(m_full$residuals~G2_2_Prtconf,data=sub_q1,ylab="Residuals", main="G2_2_Prtconf")

plot(m_full$residuals~G2_2_Prtoph,data=sub_q1,ylab="Residuals", main="G2_2_Prtoph")

plot(m_full$residuals~G2_2_Prtevt,data=sub_q1,ylab="Residuals", main="G2_2_Prtevt")

The plots show that the variance of the residuals is approximately constant while there are some deviations at high (G2_2_Prthlp) and low (G2_2_Prtconf,G2_2_Prtoph) fitted values.

Based on the multiple linear model, the student whose parent actively volunteering at shcool, attending school conference, attending school open house, and attending school art/music events will have higher reading score. But overall the correlation is weak (Multiple R-squared:  0.06441).

Q1.3. Do parent involvement in summer learning influence child early academy outcomes?

Schacter and Jo demonstrated that summer learning improve the achievement of economically disadvantaged first graders.

Q1.3.1.Search for the best model

full model: predicts reading score based on child attending summer school (Sumsch), summer school math (Summth), doing wrting activity with child in summer(Sumwrt), summer school reading (Sumrd), how long read to child in summer (Sumreadtime).

sub_q1 <- sub_g2read[!is.na(sub_g2read$Sumsch),]  
sub_q1 <- sub_q1[!is.na(sub_q1$Summth),]  
sub_q1 <- sub_q1[!is.na(sub_q1$Sumwrt),] 
sub_q1 <- sub_q1[!is.na(sub_q1$Sumrd),]
sub_q1 <- sub_q1[!is.na(sub_q1$Sumreadtime),]
sub_q1 <- sub_q1[!is.na(sub_q1$G2_2_READ),]

m_full <- lm(G2_2_READ ~ Sumsch + Summth + Sumwrt + Sumrd +Sumreadtime, data = sub_q1)
summary(m_full)

## 
## Call:
## lm(formula = G2_2_READ ~ Sumsch + Summth + Sumwrt + Sumrd + Sumreadtime, 
##     data = sub_q1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.25695 -0.36256  0.03253  0.42795  1.63264 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.97223    0.14247  13.843  < 2e-16 ***
## Sumsch      -0.01305    0.02718  -0.480  0.63123    
## Summth      -0.02394    0.01583  -1.512  0.13061    
## Sumwrt      -0.04695    0.01483  -3.166  0.00156 ** 
## Sumrd        0.13134    0.01485   8.846  < 2e-16 ***
## Sumreadtime  0.02392    0.01444   1.657  0.09767 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6468 on 3554 degrees of freedom
## Multiple R-squared:  0.024,  Adjusted R-squared:  0.02262 
## F-statistic: 17.48 on 5 and 3554 DF,  p-value: < 2.2e-16

Using backward-selection and p-value as the selection criterion, determine the best model. Drop the variable with the highest p-value, which are Sumsch, Summth and Sumreadtime, and re-fit the model.

m_backward <- lm(G2_2_READ ~ Sumwrt + Sumrd, data = sub_q1)
summary(m_backward)

## 
## Call:
## lm(formula = G2_2_READ ~ Sumwrt + Sumrd, data = sub_q1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.26199 -0.35991  0.03373  0.42663  1.63298 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.91269    0.05125  37.324  < 2e-16 ***
## Sumwrt      -0.05484    0.01325  -4.140 3.55e-05 ***
## Sumrd        0.13111    0.01466   8.941  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6469 on 3557 degrees of freedom
## Multiple R-squared:  0.02267,    Adjusted R-squared:  0.02212 
## F-statistic: 41.26 on 2 and 3557 DF,  p-value: < 2.2e-16

After dropping PreLitr and Prerc, the coefficients and significance of Sumrd was unchanged but Sumwrt became much smaller, suggesting that the dropped variable was collinear with Sumwrt.

linear model:
$$\widehat{Score}_{reading} = 1.91269 -0.05484\times Sumwrt +0.13111\times Sumrd $$
Q1.3.3.Verify that the conditions for this model are reasonable using diagnostic plots.

Q1.3.3.1 Normal probability plot.

qqnorm(m_backward$residuals)
qqline(m_backward$residuals)

The residuals of the model are nearly normal as shown in the QQ plot. While there are a few observations that deviate noticeably from the line, they are not particularly extreme.

Q1.3.3.2 Absolute values of residuals against fitted values ($\hat{y_i}$).

fitted_backward <- 1.91269 - 0.05484* sub_q1$Sumwrt + 0.13111* sub_q1$Sumrd

plot(round(fitted_backward,1),m_backward$residuals,ylab="Absolute value of residuals", xlab="Fitted values")

The plot shows that the variance of the residuals is approximately constant.

Q1.3.3.3 Residuals against each predictor variable.

plot(m_backward$residuals~Sumwrt,data=sub_q1,ylab="Residuals", main="Sumwrt")

plot(m_backward$residuals~Sumrd,data=sub_q1,ylab="Residuals", main="Sumrd")

The plot shows that the variance of the residuals is approximately constant.

Based on the multiple linear model, the student who do reading and writing activities will have higher reading score. But overall the correlation is weak (Multiple R-squared:  0.13111).

Q2. Does school have a rule of that student enter kindergarten must turn 5 before September 1st have effects on children's academic performance in the early childhood?

# In order to investigate the correlation between age and reading skill. the reading score data from first semester(K1), 2nd semester(K2), 4th semester(grade 1_2), 6th seester(grade2_2), and the data about September cutoff will be further cleaned. filter cases that don't have "NA".

# remove rows with NA in reading score
earlychildhood1 <- earlychildhood[!is.na(earlychildhood$K1_READ),]
earlychildhood2 <- earlychildhood[!is.na(earlychildhood$K2_READ),]
earlychildhood3 <- earlychildhood[!is.na(earlychildhood$G1_2_READ),]
earlychildhood4 <- earlychildhood[!is.na(earlychildhood$G2_2_READ),]

# remove rows with NA in September 1st cutoff  
earlychildhood5 <- earlychildhood1[!is.na(earlychildhood1$G2_2_Sep1Cut_t),]
earlychildhood6 <- earlychildhood2[!is.na(earlychildhood2$G2_2_Sep1Cut_t),]
earlychildhood7 <- earlychildhood3[!is.na(earlychildhood3$G2_2_Sep1Cut_t),]
earlychildhood8 <- earlychildhood4[!is.na(earlychildhood4$G2_2_Sep1Cut_t),]

# remove rows with NA in PrtEDU
earlychildhood9 <- filter(earlychildhood1, PrtEDU != "NA")
earlychildhood10 <- filter(earlychildhood2, PrtEDU != "NA")
earlychildhood11 <- filter(earlychildhood3, PrtEDU != "NA")
earlychildhood12 <- filter(earlychildhood4, PrtEDU != "NA")

# statistic for explanatory variables
summary(earlychildhood5$G2_2_Sep1Cut)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3836  1.0000  1.0000

summary(earlychildhood6$G2_2_Sep1Cut)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3825  1.0000  1.0000

summary(earlychildhood7$G2_2_Sep1Cut)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3821  1.0000  1.0000

summary(earlychildhood8$G2_2_Sep1Cut)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3812  1.0000  1.0000

summary(earlychildhood9$PrtEDU)

##    Length     Class      Mode 
##     14509 character character

summary(earlychildhood10$PrtEDU)

##    Length     Class      Mode 
##     14389 character character

summary(earlychildhood11$PrtEDU)

##    Length     Class      Mode 
##     12580 character character

summary(earlychildhood12$PrtEDU)

##    Length     Class      Mode 
##     11529 character character

# statistic for response variable
summary(earlychildhood2$K2_READ)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3.0000  0.0005  0.4950  0.4473  0.9352  2.9780

summary(earlychildhood2$K2_Math)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
## -5.86600  0.02135  0.52520  0.44730  0.95060  2.88200       43

summary(earlychildhood2$K2_SCI)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
## -2.40900 -0.56000  0.11450 -0.00694  0.65920  1.89000      249

summary(earlychildhood2$K2_DCCSTOT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   15.00   15.00   15.15   17.00   18.00      43

summary(earlychildhood1$K1_READ)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3.0850 -1.1360 -0.5695 -0.5422 -0.0060  2.9780

summary(earlychildhood1$K1_Math)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -5.8600 -1.0430 -0.3897 -0.4910  0.1369  5.3200     110

summary(earlychildhood1$K1_DCCSTOT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   14.00   15.00   14.23   16.00   18.00     120

# example for summary of 1st semester reading score

# subest 2nd semester reading score and september 1st cutoff for statistic analysis
k2read <- data.frame("CHILDID"=earlychildhood5$CHILDID,"K2_READ"=earlychildhood5$K2_READ,"G2_2_Sep1Cut_t"=earlychildhood5$G2_2_Sep1Cut_t)

k2read <- k2read %>% mutate_if(is.factor, as.character)  

# do the statistic on 2nd semester science score 
stat <- k2read %>%  group_by(G2_2_Sep1Cut_t) %>% summarise (mean=mean(K2_READ,na.rm=T),sd(K2_READ,na.rm=T), median=median(K2_READ,na.rm=T), min=min(K2_READ,na.rm=T),max=max(K2_READ,na.rm=T))

ggplot(data=subset(k2read,!is.na(k2read$K2_READ)),aes(factor(G2_2_Sep1Cut_t),K2_READ))+
  geom_boxplot()+
  xlab("September 1st Cut Off for Kindergarten") + 
  ylab("2011 Spring Kindergarten Reading Score")

Sep1cutyes <- earlychildhood[which(earlychildhood$G2_2_Sep1Cut_t == "yes"),]
Sep1cutno <-  earlychildhood[which(earlychildhood$G2_2_Sep1Cut_t == 'no'),]
t.test(Sep1cutyes$G2_2_READ,Sep1cutyes$G2_2_READ)

## 
##  Welch Two Sample t-test
## 
## data:  Sep1cutyes$G2_2_READ and Sep1cutyes$G2_2_READ
## t = 0, df = 15414, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.02029484  0.02029484
## sample estimates:
## mean of x mean of y 
##  2.168926  2.168926

Sep1cutyes <- earlychildhood[which(earlychildhood$G2_2_Sep1Cut_t == "yes"),]
Sep1cutno <-  earlychildhood[which(earlychildhood$G2_2_Sep1Cut_t == 'no'),]
t.test(Sep1cutyes$G2_2_Math,Sep1cutyes$G2_2_Math)

## 
##  Welch Two Sample t-test
## 
## data:  Sep1cutyes$G2_2_Math and Sep1cutyes$G2_2_Math
## t = 0, df = 15408, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.02602507  0.02602507
## sample estimates:
## mean of x mean of y 
##  2.425733  2.425733

Sep1cutyes <- earlychildhood[which(earlychildhood$G2_2_Sep1Cut_t == "yes"),]
Sep1cutno <-  earlychildhood[which(earlychildhood$G2_2_Sep1Cut_t == 'no'),]
t.test(Sep1cutyes$G2_2_SCI,Sep1cutyes$G2_2_SCI)

## 
##  Welch Two Sample t-test
## 
## data:  Sep1cutyes$G2_2_SCI and Sep1cutyes$G2_2_SCI
## t = 0, df = 15400, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.03009313  0.03009313
## sample estimates:
## mean of x mean of y 
##  1.569671  1.569671

Sep1cutyes <- earlychildhood[which(earlychildhood$G2_2_Sep1Cut_t == "yes"),]
Sep1cutno <-  earlychildhood[which(earlychildhood$G2_2_Sep1Cut_t == 'no'),]
t.test(Sep1cutyes$G2_2_DCCSSCR,Sep1cutyes$G2_2_DCCSSCR)

## 
##  Welch Two Sample t-test
## 
## data:  Sep1cutyes$G2_2_DCCSSCR and Sep1cutyes$G2_2_DCCSSCR
## t = 0, df = 15342, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.04300482  0.04300482
## sample estimates:
## mean of x mean of y 
##  6.666743  6.666743

Comparing the academic outcomes of children from schools with September 1st cut off rule and from schools not having September 1st cut off rule, there is no difference in reading, math, science, and DCCSSCR score as p-value is > 0.05.

Q3. Does parents' education level have effects on children's academic performance in the early childhood?

# subest 6th semester reading score and parents eduction level for statistic analysis
g2edu <- data.frame("CHILDID"=earlychildhood12$CHILDID,"G2_2_READ"=earlychildhood12$G2_2_READ,"PrtEDU"=earlychildhood12$PrtEDU)

g2edu  %>% mutate_if(is.factor, as.character) -> g2edu

ggplot(data=subset(g2edu,!is.na(g2edu$G2_2_READ)),aes(factor(PrtEDU),G2_2_READ))+
  geom_boxplot(aes(fill=factor(PrtEDU)))+
  xlab("Parent Education Level")+
  ylab("2013 Spring 2nd Grade Reading Score")

# subest 6th semester academic scores and parents eduction level for statistic analysis
g2edu <- data.frame("CHILDID"=earlychildhood12$CHILDID,"G2_2_READ"=earlychildhood12$G2_2_READ, "G2_2_Math"=earlychildhood12$G2_2_Math, "G2_2_SCI"=earlychildhood12$G2_2_SCI,"G2_2_DCCSSCR"=earlychildhood12$G2_2_DCCSSCR,"PrtEDU"=earlychildhood12$PrtEDU)

g2edu  %>% mutate_if(is.factor, as.character) -> g2edu 

g2edu$PrtEDU <- str_replace_all(g2edu$PrtEDU,"2|3|4|5|6|7","above high school")
g2edu$PrtEDU <- str_replace_all(g2edu$PrtEDU,"1","below high school")

ggplot(data=subset(g2edu,!is.na(g2edu$G2_2_READ)),aes(factor(PrtEDU),G2_2_READ))+
  geom_boxplot(aes(fill=factor(PrtEDU)))+
  xlab("Parent Education Level")+
  ylab("2013 Spring 2nd Grade Reading Score")+
  labs(title = "Reading")

PrtblwHisch <- g2edu[which(g2edu$PrtEDU == "below high school"),]
PrtabvHisch <-  g2edu[which(g2edu$PrtEDU == 'above high school'),]
t.test(PrtblwHisch$G2_2_READ,PrtabvHisch$G2_2_READ)

## 
##  Welch Two Sample t-test
## 
## data:  PrtblwHisch$G2_2_READ and PrtabvHisch$G2_2_READ
## t = -6.8754, df = 1063.7, p-value = 1.052e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1972397 -0.1096544
## sample estimates:
## mean of x mean of y 
##  2.052743  2.206190

ggplot(data=subset(g2edu,!is.na(g2edu$G2_2_Math)),aes(factor(PrtEDU),G2_2_Math))+
  geom_boxplot(aes(fill=factor(PrtEDU)))+
  xlab("Parent Education Level")+
  ylab("2013 Spring 2nd Grade Math Score")+
  labs(title = "Math")

PrtblwHisch <- g2edu[which(g2edu$PrtEDU == "below high school"),]
PrtabvHisch <-  g2edu[which(g2edu$PrtEDU == 'above high school'),]
t.test(PrtabvHisch$G2_2_Math,PrtblwHisch$G2_2_Math)

## 
##  Welch Two Sample t-test
## 
## data:  PrtabvHisch$G2_2_Math and PrtblwHisch$G2_2_Math
## t = 5.4003, df = 1076.1, p-value = 8.184e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.0950871 0.2036204
## sample estimates:
## mean of x mean of y 
##  2.464817  2.315463

ggplot(data=subset(g2edu,!is.na(g2edu$G2_2_SCI)),aes(factor(PrtEDU),G2_2_SCI))+
  geom_boxplot(aes(fill=factor(PrtEDU)))+
  xlab("Parent Education Level")+
  ylab("2013 Spring 2nd Grade Reading Score")+
  labs(title = "Science")

PrtblwHisch <- g2edu[which(g2edu$PrtEDU == "below high school"),]
PrtabvHisch <-  g2edu[which(g2edu$PrtEDU == 'above high school'),]
t.test(PrtblwHisch$G2_2_SCI,PrtabvHisch$G2_2_SCI)

## 
##  Welch Two Sample t-test
## 
## data:  PrtblwHisch$G2_2_SCI and PrtabvHisch$G2_2_SCI
## t = -8.8582, df = 1029, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3900706 -0.2485934
## sample estimates:
## mean of x mean of y 
##  1.304297  1.623629

ggplot(data=subset(g2edu,!is.na(g2edu$G2_2_DCCSSCR)),aes(factor(PrtEDU),G2_2_DCCSSCR))+
  geom_boxplot(aes(fill=factor(PrtEDU)))+
  xlab("Parent Education Level")+
  ylab("2013 Spring 2nd Grade Reading Score")+
  labs(title = "Dimentional Card Sort")

PrtblwHisch <- g2edu[which(g2edu$PrtEDU == "below high school"),]
PrtabvHisch <-  g2edu[which(g2edu$PrtEDU == 'above high school'),]
t.test(PrtblwHisch$G2_2_DCCSSCR,PrtabvHisch$G2_2_DCCSSCR)

## 
##  Welch Two Sample t-test
## 
## data:  PrtblwHisch$G2_2_DCCSSCR and PrtabvHisch$G2_2_DCCSSCR
## t = -1.5261, df = 1063.5, p-value = 0.1273
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.16343219  0.02043077
## sample estimates:
## mean of x mean of y 
##  6.638675  6.710176

There is significant difference in reading, math, and science scores between the students whose parents highest education level is below high school or beyond high school. However, it is very interesting to see that there is no significant difference in dimentional card sort  scores between these two categories.

Part 5 - Conclusion:

1.Correlations between parent involvement in preschool literacy, scholl activities, and summer learnings and the reading score are weak. So variables on parent involvement related to early literacy are not appropriate predictors to early childhoold academy outcomes(reading sscore).

2.Whether schools has September 1st cut off rule for kindergarten entry or not did not influence students academic outcomes.

3.When looking at the influence of parents' highest education level on the academic outcomes, there was a clear difference in reading, math, and science scores between students whose parents finished high school and those whose parents did not finish high school. But the Dimentional Card Sort score of the students of these two caterogies are the same. As math and science both strongly and positively associates with reading score, reading, math, and science are more litercy-related. However, Dimentional Card Sort has weak correlation to reading score and it may largly reflects IQ. These observations suggested that parents' education level will have positive influence on literacy but not IQ. Literacy intervention in early childhood may improve the achievement of the students who are not performing well in reading, math and science but with no problem in sloving problems like Dimentional Card Sorting.

References:

1.Early Childhood Longitudinal Study (ECLS) program collected by the National Center for Educational Statistics(NCES)(http://nces.ed.gov//)

2.ECLS-K:2011 Kindergarten User's Manual, Public Version PDF File. (https://nces.ed.gov/ecls/dataproducts.asp)

3.ECLS-K:2011 Kindergarten-Second Grade User's Manual, Public Version PDF File.(https://nces.ed.gov/ecls/dataproducts.asp)

Miedel WT, Reynolds AJ. Parent involvement in early intervention for disadvantaged children: Does it matter? Journal of School Psychology. 1999 ; 37: p379-402.
John Schacter and Booli Jo, Learning when school is not in session; a reading summer day-camp intervention to improve the achievement of exiting First-Grade students who are economically disadvantaged. Journal of Research in Reading. 2005; 28(2), p158-169.