Analytics Edge: Unit 2 - Reading Test Scores

Reading Test Scores

Background Information on the Dataset

The Programme for International Student Assessment (PISA) is a test given every three years to 15-year-old students from around the world to evaluate their performance in mathematics, reading, and science. This test provides a quantitative way to compare the performance of students from different parts of the world. In this homework assignment, we will predict the reading scores of students from the United States of America on the 2009 PISA exam.

The datasets pisa2009train.csv and pisa2009test.csv contain information about the demographics and schools for American students taking the exam, derived from 2009 PISA Public-Use Data Files distributed by the United States National Center for Education Statistics (NCES). While the datasets are not supposed to contain identifying information about students taking the test, by using the data you are bound by the NCES data use agreement, which prohibits any attempt to determine the identity of any student in the datasets.

Each row in the datasets pisa2009train.csv and pisa2009test.csv represents one student taking the exam. The datasets have the following variables:

grade: The grade in school of the student (most 15-year-olds in America are in 10th grade)

male: Whether the student is male (1/0)
raceeth: The race/ethnicity composite of the student
preschool: Whether the student attended preschool (1/0)
expectBachelors: Whether the student expects to obtain a bachelor’s degree (1/0)
motherHS: Whether the student’s mother completed high school (1/0)
motherBachelors: Whether the student’s mother obtained a bachelor’s degree (1/0)
motherWork: Whether the student’s mother has part-time or full-time work (1/0)
fatherHS: Whether the student’s father completed high school (1/0)
fatherBachelors: Whether the student’s father obtained a bachelor’s degree (1/0)
fatherWork: Whether the student’s father has part-time or full-time work (1/0)
selfBornUS: Whether the student was born in the United States of America (1/0)
motherBornUS: Whether the student’s mother was born in the United States of America (1/0)
fatherBornUS: Whether the student’s father was born in the United States of America (1/0)
englishAtHome: Whether the student speaks English at home (1/0)
computerForSchoolwork: Whether the student has access to a computer for schoolwork (1/0)
read30MinsADay: Whether the student reads for pleasure for 30 minutes/day (1/0)
minutesPerWeekEnglish: The number of minutes per week the student spend in English class
studentsInEnglish: The number of students in this student’s English class at school
schoolHasLibrary: Whether this student’s school has a library (1/0)
publicSchool: Whether this student attends a public school (1/0)
urban: Whether this student’s school is in an urban area (1/0)
schoolSize: The number of students in this student’s school
readingScore: The student’s reading score, on a 1000-point scale

Dataset Analysis

Load the training and testing sets using the read.csv() function, and save them as variables with the names pisaTrain and pisaTest.

# Load in the training and testing sets
pisaTrain = read.csv("pisa2009train.csv")
pisaTest = read.csv("pisa2009test.csv")

How many students are there in the training set?

# Calculate the number of rows
nrow(pisaTrain)
## [1] 3663

3663 students.

Using tapply() on pisaTrain, what is the average reading test score of males?

# Compare two groups using a statistical measure
z = tapply(pisaTrain$readingScore, pisaTrain$male, mean) 
kable(z)

	x
0	512.9406
1	483.5325

483.5325 is the average reading test score of males

Using tapply() on pisaTrain, what is the average reading test score of females?

# Compare two groups using a statistical measure
tapply(pisaTrain$readingScore, pisaTrain$male, mean)

512.9406 is the average reading test score of females

Which variables are missing data in at least one observation in the training set?

# Outputs a summary of the dataset
z = summary(pisaTrain)
kable(z)

grade	male	raceeth	preschool	expectBachelors	motherHS	motherBachelors	motherWork	fatherHS	fatherBachelors	fatherWork	selfBornUS	motherBornUS	fatherBornUS	englishAtHome	computerForSchoolwork	read30MinsADay	minutesPerWeekEnglish	studentsInEnglish	schoolHasLibrary	publicSchool	urban	schoolSize	readingScore
Min. : 8.00	Min. :0.0000	White :2015	Min. :0.0000	Min. :0.0000	Min. :0.00	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. : 0.0	Min. : 1.0	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. : 100	Min. :168.6
1st Qu.:10.00	1st Qu.:0.0000	Hispanic : 834	1st Qu.:0.0000	1st Qu.:1.0000	1st Qu.:1.00	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:1.0000	1st Qu.:0.0000	1st Qu.:1.0000	1st Qu.:1.0000	1st Qu.:1.0000	1st Qu.:1.0000	1st Qu.:1.0000	1st Qu.:1.0000	1st Qu.:0.0000	1st Qu.: 225.0	1st Qu.:20.0	1st Qu.:1.0000	1st Qu.:1.0000	1st Qu.:0.0000	1st Qu.: 712	1st Qu.:431.7
Median :10.00	Median :1.0000	Black : 444	Median :1.0000	Median :1.0000	Median :1.00	Median :0.0000	Median :1.0000	Median :1.0000	Median :0.0000	Median :1.0000	Median :1.0000	Median :1.0000	Median :1.0000	Median :1.0000	Median :1.0000	Median :0.0000	Median : 250.0	Median :25.0	Median :1.0000	Median :1.0000	Median :0.0000	Median :1212	Median :499.7
Mean :10.09	Mean :0.5111	Asian : 143	Mean :0.7228	Mean :0.7859	Mean :0.88	Mean :0.3481	Mean :0.7345	Mean :0.8593	Mean :0.3319	Mean :0.8531	Mean :0.9313	Mean :0.7725	Mean :0.7668	Mean :0.8717	Mean :0.8994	Mean :0.2899	Mean : 266.2	Mean :24.5	Mean :0.9676	Mean :0.9339	Mean :0.3849	Mean :1369	Mean :497.9
3rd Qu.:10.00	3rd Qu.:1.0000	More than one race: 124	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.00	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.: 300.0	3rd Qu.:30.0	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:1900	3rd Qu.:566.2
Max. :12.00	Max. :1.0000	(Other) : 68	Max. :1.0000	Max. :1.0000	Max. :1.00	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :2400.0	Max. :75.0	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :6694	Max. :746.0
NA	NA	NA’s : 35	NA’s :56	NA’s :62	NA’s :97	NA’s :397	NA’s :93	NA’s :245	NA’s :569	NA’s :233	NA’s :69	NA’s :71	NA’s :113	NA’s :71	NA’s :65	NA’s :34	NA’s :186	NA’s :249	NA’s :143	NA	NA	NA’s :162	NA

Because most variables are collected from study participants via survey, it is expected that most questions will have at least one missing value.

Remove the missing values

Linear regression discards observations with missing data, so we will remove all such observations from the training and testing sets. Later in the course, we will learn about imputation, which deals with missing data by filling in missing values with plausible information.

Type the following commands into your R console to remove observations with any missing value from pisaTrain and pisaTest:

# Discards observations with missing data
pisaTrain = na.omit(pisaTrain)
pisaTest = na.omit(pisaTest)

How many observations are now in the training set?

# Calculates the number of observations
pisaTrain = na.omit(pisaTrain)
nrow(pisaTrain)
## [1] 2414

2414 observations

How many observations are now in the testing set?

# Calculates the number of observations
pisaTest = na.omit(pisaTest)
nrow(pisaTest)
## [1] 990

990 observations

Factor Variables

Factor variables are variables that take on a discrete set of values, like the “Region” variable in the WHO dataset from the second lecture of Unit 1. This is an unordered factor because there isn’t any natural ordering between the levels. An ordered factor has a natural ordering between the levels (an example would be the classifications “large,” “medium,” and “small”).

Which of the following variables is an unordered factor with at least 3 levels? Which of the following variables is an ordered factor with at least 3 levels?

Male only has 2 levels (1 and 0). There is no natural ordering between the different values of raceeth, so it is an unordered factor. Meanwhile, we can order grades (8, 9, 10, 11, 12), so it is an ordered factor.

Unordered factors in regression models

To include unordered factors in a linear regression model, we define one level as the “reference level” and add a binary variable for each of the remaining levels. In this way, a factor with n levels is replaced by n-1 binary variables. The reference level is typically selected to be the most frequently occurring level in the dataset.

As an example, consider the unordered factor variable “color”, with levels “red”, “green”, and “blue”. If “green” were the reference level, then we would add binary variables “colorred” and “colorblue” to a linear regression problem. All red examples would have colorred=1 and colorblue=0. All blue examples would have colorred=0 and colorblue=1. All green examples would have colorred=0 and colorblue=0.

Now, consider the variable “raceeth” in our problem, which has levels “American Indian/Alaska Native”, “Asian”, “Black”, “Hispanic”, “More than one race”, “Native Hawaiian/Other Pacific Islander”, and “White”. Because it is the most common in our population, we will select White as the reference level.

Which binary variables will be included in the regression model?

We create a binary variable for each level except the reference level, so we would create all these variables except for raceethWhite.

Example unordered factors

Consider again adding our unordered factor race to the regression model with reference level “White”.

For a student who is Asian, which binary variables would be set to 0? All remaining variables will be set to 1. For a student who is white, which binary variables would be set to 0? All remaining variables will be set to 1.

An Asian student will have raceethAsian set to 1 and all other raceeth binary variables set to 0. Because “White” is the reference level, a white student will have all raceeth binary variables set to 0.

Building a model

Because the race variable takes on text values, it was loaded as a factor variable when we read in the dataset with read.csv() – you can see this when you run str(pisaTrain) or str(pisaTest). However, by default R selects the first level alphabetically (“American Indian/Alaska Native”) as the reference level of our factor instead of the most common level (“White”). Set the reference level of the factor by typing the following two lines in your R console:

# Set the reference level of the factor
pisaTrain$raceeth = relevel(pisaTrain$raceeth, "White")
pisaTest$raceeth = relevel(pisaTest$raceeth, "White")

Now, build a linear regression model (call it lmScore) using the training set to predict readingScore using all the remaining variables.

It would be time-consuming to type all the variables, but R provides the shorthand notation “readingScore ~ .” to mean “predict readingScore using all the other variables in the data frame.” The period is used to replace listing out all of the independent variables. As an example, if your dependent variable is called “Y”, your independent variables are called “X1”, “X2”, and “X3”, and your training data set is called “Train”, instead of the regular notation:

LinReg = lm(Y ~ X1 + X2 + X3, data = Train)

You would use the following command to build your model:

LinReg = lm(Y ~ ., data = Train)

What is the Multiple R² value?

# Create linear regression model
lmScore = lm(readingScore~., data=pisaTrain)
# Output the summary
summary(lmScore)
## 
## Call:
## lm(formula = readingScore ~ ., data = pisaTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -247.44  -48.86    1.86   49.77  217.18 
## 
## Coefficients:
##                                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                                   143.766333  33.841226   4.248 2.24e-05 ***
## grade                                          29.542707   2.937399  10.057  < 2e-16 ***
## male                                          -14.521653   3.155926  -4.601 4.42e-06 ***
## raceethAmerican Indian/Alaska Native          -67.277327  16.786935  -4.008 6.32e-05 ***
## raceethAsian                                   -4.110325   9.220071  -0.446  0.65578    
## raceethBlack                                  -67.012347   5.460883 -12.271  < 2e-16 ***
## raceethHispanic                               -38.975486   5.177743  -7.528 7.29e-14 ***
## raceethMore than one race                     -16.922522   8.496268  -1.992  0.04651 *  
## raceethNative Hawaiian/Other Pacific Islander  -5.101601  17.005696  -0.300  0.76421    
## preschool                                      -4.463670   3.486055  -1.280  0.20052    
## expectBachelors                                55.267080   4.293893  12.871  < 2e-16 ***
## motherHS                                        6.058774   6.091423   0.995  0.32001    
## motherBachelors                                12.638068   3.861457   3.273  0.00108 ** 
## motherWork                                     -2.809101   3.521827  -0.798  0.42517    
## fatherHS                                        4.018214   5.579269   0.720  0.47147    
## fatherBachelors                                16.929755   3.995253   4.237 2.35e-05 ***
## fatherWork                                      5.842798   4.395978   1.329  0.18393    
## selfBornUS                                     -3.806278   7.323718  -0.520  0.60331    
## motherBornUS                                   -8.798153   6.587621  -1.336  0.18182    
## fatherBornUS                                    4.306994   6.263875   0.688  0.49178    
## englishAtHome                                   8.035685   6.859492   1.171  0.24153    
## computerForSchoolwork                          22.500232   5.702562   3.946 8.19e-05 ***
## read30MinsADay                                 34.871924   3.408447  10.231  < 2e-16 ***
## minutesPerWeekEnglish                           0.012788   0.010712   1.194  0.23264    
## studentsInEnglish                              -0.286631   0.227819  -1.258  0.20846    
## schoolHasLibrary                               12.215085   9.264884   1.318  0.18749    
## publicSchool                                  -16.857475   6.725614  -2.506  0.01226 *  
## urban                                          -0.110132   3.962724  -0.028  0.97783    
## schoolSize                                      0.006540   0.002197   2.977  0.00294 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 73.81 on 2385 degrees of freedom
## Multiple R-squared:  0.3251, Adjusted R-squared:  0.3172 
## F-statistic: 41.04 on 28 and 2385 DF,  p-value: < 2.2e-16

R² = 0.3251

What is the training-set root-mean squared error (RMSE) of lmScore?

# Compute RMSE
SSE = sum(lmScore$residuals^2)
RMSE = sqrt(SSE / nrow(pisaTrain))
RMSE
## [1] 73.36555

RMSE = 73.36555

Comparing predictions for similar students

Consider two students A and B. They have all variable values the same, except that student A is in grade 11 and student B is in grade 9.

What is the predicted reading score of student A minus the predicted reading score of student B?

The coefficient 29.54 on grade is the difference in reading score between two students who are identical other than having a difference in grade of 1. Because A and B have a difference in grade of 2, the model predicts that student A has a reading score that is 2*29.54 larger.

What is the meaning of the coefficient associated with variable raceethAsian?

The only difference between an Asian student and white student with otherwise identical variables is that the former has raceethAsian=1 and the latter has raceethAsian=0. The predicted reading score for these two students will differ by the coefficient on the variable raceethAsian.

Based on the significance codes, which variables are candidates for removal from the model?

# Output the summary
summary(lmScore)
## 
## Call:
## lm(formula = readingScore ~ ., data = pisaTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -247.44  -48.86    1.86   49.77  217.18 
## 
## Coefficients:
##                                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                                   143.766333  33.841226   4.248 2.24e-05 ***
## grade                                          29.542707   2.937399  10.057  < 2e-16 ***
## male                                          -14.521653   3.155926  -4.601 4.42e-06 ***
## raceethAmerican Indian/Alaska Native          -67.277327  16.786935  -4.008 6.32e-05 ***
## raceethAsian                                   -4.110325   9.220071  -0.446  0.65578    
## raceethBlack                                  -67.012347   5.460883 -12.271  < 2e-16 ***
## raceethHispanic                               -38.975486   5.177743  -7.528 7.29e-14 ***
## raceethMore than one race                     -16.922522   8.496268  -1.992  0.04651 *  
## raceethNative Hawaiian/Other Pacific Islander  -5.101601  17.005696  -0.300  0.76421    
## preschool                                      -4.463670   3.486055  -1.280  0.20052    
## expectBachelors                                55.267080   4.293893  12.871  < 2e-16 ***
## motherHS                                        6.058774   6.091423   0.995  0.32001    
## motherBachelors                                12.638068   3.861457   3.273  0.00108 ** 
## motherWork                                     -2.809101   3.521827  -0.798  0.42517    
## fatherHS                                        4.018214   5.579269   0.720  0.47147    
## fatherBachelors                                16.929755   3.995253   4.237 2.35e-05 ***
## fatherWork                                      5.842798   4.395978   1.329  0.18393    
## selfBornUS                                     -3.806278   7.323718  -0.520  0.60331    
## motherBornUS                                   -8.798153   6.587621  -1.336  0.18182    
## fatherBornUS                                    4.306994   6.263875   0.688  0.49178    
## englishAtHome                                   8.035685   6.859492   1.171  0.24153    
## computerForSchoolwork                          22.500232   5.702562   3.946 8.19e-05 ***
## read30MinsADay                                 34.871924   3.408447  10.231  < 2e-16 ***
## minutesPerWeekEnglish                           0.012788   0.010712   1.194  0.23264    
## studentsInEnglish                              -0.286631   0.227819  -1.258  0.20846    
## schoolHasLibrary                               12.215085   9.264884   1.318  0.18749    
## publicSchool                                  -16.857475   6.725614  -2.506  0.01226 *  
## urban                                          -0.110132   3.962724  -0.028  0.97783    
## schoolSize                                      0.006540   0.002197   2.977  0.00294 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 73.81 on 2385 degrees of freedom
## Multiple R-squared:  0.3251, Adjusted R-squared:  0.3172 
## F-statistic: 41.04 on 28 and 2385 DF,  p-value: < 2.2e-16

Predicting unseen data

Using the “predict” function and supplying the “newdata” argument, use the lmScore model to predict the reading scores of students in pisaTest. Call this vector of predictions “predTest”. Do not change the variables in the model (for example, do not remove variables that we found were not significant in the previous part of this problem). Use the summary function to describe the test set predictions.

# Make predictions using the linear regression model on the test set
predTest = predict(lmScore, newdata=pisaTest)

What is the range between the maximum and minimum predicted reading score on the test set?

# Calculates the range
range = max(predTest) - min(predTest)
range
## [1] 284.4683

Range = 284.5

What is the sum of squared errors (SSE) of lmScore on the testing set?

# Calculate SSE
sum((predTest-pisaTest$readingScore)^2)
## [1] 5762082

SSE = 5762082

What is the root-mean squared error (RMSE) of lmScore on the testing set?

RMSE = sqrt(mean((predTest-pisaTest$readingScore)^2))

RMSE = 76.29079

What is the predicted test score used in the baseline model? Remember to compute this value using the training set and not the test set.

# Compute the baseline
baseline = mean(pisaTrain$readingScore) 
baseline
## [1] 517.9629

Baseline = 517.9629

What is the sum of squared errors of the baseline model on the testing set?

# Calculate SSE
sum((baseline-pisaTest$readingScore)^2) 
## [1] 7802354

SSE = 7802354

What is the test-set R-squared value of lmScore?

# Calculate test set R^2
Rsquared = 1 - sum((predTest-pisaTest$readingScore)^2)/sum((baseline-pisaTest$readingScore)^2)
Rsquared
## [1] 0.2614944

R² = 0.2614944

Analytics Edge: Unit 2 - Reading Test Scores

Sulman Khan

October 26, 2018

Reading Test Scores

Background Information on the Dataset

Dataset Analysis

How many students are there in the training set?

Using tapply() on pisaTrain, what is the average reading test score of males?

Using tapply() on pisaTrain, what is the average reading test score of females?

Which variables are missing data in at least one observation in the training set?

Remove the missing values

How many observations are now in the training set?

How many observations are now in the testing set?

Factor Variables

Which of the following variables is an unordered factor with at least 3 levels? Which of the following variables is an ordered factor with at least 3 levels?

Unordered factors in regression models

Which binary variables will be included in the regression model?

Example unordered factors

For a student who is Asian, which binary variables would be set to 0? All remaining variables will be set to 1. For a student who is white, which binary variables would be set to 0? All remaining variables will be set to 1.

Building a model

What is the Multiple R² value?

What is the training-set root-mean squared error (RMSE) of lmScore?

Comparing predictions for similar students

What is the predicted reading score of student A minus the predicted reading score of student B?

What is the meaning of the coefficient associated with variable raceethAsian?

Based on the significance codes, which variables are candidates for removal from the model?

Predicting unseen data

What is the range between the maximum and minimum predicted reading score on the test set?

What is the sum of squared errors (SSE) of lmScore on the testing set?

What is the root-mean squared error (RMSE) of lmScore on the testing set?

What is the predicted test score used in the baseline model? Remember to compute this value using the training set and not the test set.

What is the sum of squared errors of the baseline model on the testing set?

What is the test-set R-squared value of lmScore?

Analytics Edge: Unit 2 - Reading Test Scores

Sulman Khan

October 26, 2018

Reading Test Scores

Background Information on the Dataset

Dataset Analysis

How many students are there in the training set?

Using tapply() on pisaTrain, what is the average reading test score of males?

Using tapply() on pisaTrain, what is the average reading test score of females?

Which variables are missing data in at least one observation in the training set?

Remove the missing values

How many observations are now in the training set?

How many observations are now in the testing set?

Factor Variables

Which of the following variables is an unordered factor with at least 3 levels? Which of the following variables is an ordered factor with at least 3 levels?

Unordered factors in regression models

Which binary variables will be included in the regression model?

Example unordered factors

For a student who is Asian, which binary variables would be set to 0? All remaining variables will be set to 1. For a student who is white, which binary variables would be set to 0? All remaining variables will be set to 1.

Building a model

What is the Multiple R2 value?

What is the training-set root-mean squared error (RMSE) of lmScore?

Comparing predictions for similar students

What is the predicted reading score of student A minus the predicted reading score of student B?

What is the meaning of the coefficient associated with variable raceethAsian?

Based on the significance codes, which variables are candidates for removal from the model?

Predicting unseen data

What is the range between the maximum and minimum predicted reading score on the test set?

What is the sum of squared errors (SSE) of lmScore on the testing set?

What is the root-mean squared error (RMSE) of lmScore on the testing set?

What is the predicted test score used in the baseline model? Remember to compute this value using the training set and not the test set.

What is the sum of squared errors of the baseline model on the testing set?

What is the test-set R-squared value of lmScore?

What is the Multiple R² value?