The dataset used in this week’s work is the CASchools dataset from the AER package. It is a cross-sectional data that records test performance, school characteristics and student demographic backgrounds for school districts in California between 1998 and 1999. There are 420 observations and 14 variables.
Variables:
district - school district code
school - school name
county - factor indicating county
grades - factor indicating grade span of district
students - total enrollment in a school
teachers - number of teachers in a school
calworks - percent qualifying for CalWorks (income assistance)
lunch - percent qualifying for reduced-price lunch
computer - number of computers in a school
expenditure - expenditure per student
income - average income in the district
english - percent of English learners
read - average reading score
math - average math score
The dependent variable (y) for the model is math which reflects average math score in a school.
The independent variables (x) are students, teachers, calworks, lunch, computer, expenditure, income, english.
\[ Average Math Score = \beta0+\beta1Students+\beta2Teachers+\beta3CalWorks+\beta4Lunch+...+\beta7Income+\beta8English \]
#loading dataset into R
data("CASchools")
df1 <- CASchools
#checking variables
str(df1)
## 'data.frame': 420 obs. of 14 variables:
## $ district : chr "75119" "61499" "61549" "61457" ...
## $ school : chr "Sunol Glen Unified" "Manzanita Elementary" "Thermalito Union Elementary" "Golden Feather Union Elementary" ...
## $ county : Factor w/ 45 levels "Alameda","Butte",..: 1 2 2 2 2 6 29 11 6 25 ...
## $ grades : Factor w/ 2 levels "KK-06","KK-08": 2 2 2 2 2 2 2 2 2 1 ...
## $ students : num 195 240 1550 243 1335 ...
## $ teachers : num 10.9 11.1 82.9 14 71.5 ...
## $ calworks : num 0.51 15.42 55.03 36.48 33.11 ...
## $ lunch : num 2.04 47.92 76.32 77.05 78.43 ...
## $ computer : num 67 101 169 85 171 25 28 66 35 0 ...
## $ expenditure: num 6385 5099 5502 7102 5236 ...
## $ income : num 22.69 9.82 8.98 8.98 9.08 ...
## $ english : num 0 4.58 30 0 13.86 ...
## $ read : num 692 660 636 652 642 ...
## $ math : num 690 662 651 644 640 ...
summary(df1)
## district school county grades
## Length:420 Length:420 Sonoma : 29 KK-06: 61
## Class :character Class :character Kern : 27 KK-08:359
## Mode :character Mode :character Los Angeles: 27
## Tulare : 24
## San Diego : 21
## Santa Clara: 20
## (Other) :272
## students teachers calworks lunch
## Min. : 81.0 Min. : 4.85 Min. : 0.000 Min. : 0.00
## 1st Qu.: 379.0 1st Qu.: 19.66 1st Qu.: 4.395 1st Qu.: 23.28
## Median : 950.5 Median : 48.56 Median :10.520 Median : 41.75
## Mean : 2628.8 Mean : 129.07 Mean :13.246 Mean : 44.71
## 3rd Qu.: 3008.0 3rd Qu.: 146.35 3rd Qu.:18.981 3rd Qu.: 66.86
## Max. :27176.0 Max. :1429.00 Max. :78.994 Max. :100.00
##
## computer expenditure income english
## Min. : 0.0 Min. :3926 Min. : 5.335 Min. : 0.000
## 1st Qu.: 46.0 1st Qu.:4906 1st Qu.:10.639 1st Qu.: 1.941
## Median : 117.5 Median :5215 Median :13.728 Median : 8.778
## Mean : 303.4 Mean :5312 Mean :15.317 Mean :15.768
## 3rd Qu.: 375.2 3rd Qu.:5601 3rd Qu.:17.629 3rd Qu.:22.970
## Max. :3324.0 Max. :7712 Max. :55.328 Max. :85.540
##
## read math
## Min. :604.5 Min. :605.4
## 1st Qu.:640.4 1st Qu.:639.4
## Median :655.8 Median :652.5
## Mean :655.0 Mean :653.3
## 3rd Qu.:668.7 3rd Qu.:665.9
## Max. :704.0 Max. :709.5
##
# creating a dataframe with required variables only
df2 = subset(df1, select = -c(district, school, county, grades, read) )
\[ y=X\beta + \in \]
y - vector of dependent variable (math)
X - matrix of feature variables
\(\beta\) - vector of parameters to be estimated
\(\in\) - vector of error term
# dependent variable
y <- as.vector(df2$math)
# creating a matrix of feature variables from CASchools
X <- as.matrix(df2[-ncol(df2)])
# replicating the values in x according to the number of rows in CASchools
int <- rep(x = 1, times = length(y))
# adding intercept column to X
X <- cbind(int, X)
remove(int)
#implementing closed-form solution to X and y
betas <- solve(t(X) %*% X) %*% t(X) %*% y
betas <- round(x = betas,digits = 2)
betas
## [,1]
## int 656.15
## students 0.00
## teachers 0.01
## calworks -0.13
## lunch -0.33
## computer 0.00
## expenditure 0.00
## income 0.70
## english -0.15
# building linear regression model
lm_model <- lm(math ~ ., data=df2)
lm_betas <- round(x = lm_model$coefficients, digits = 2)
lm_betas
## (Intercept) students teachers calworks lunch computer
## 656.15 0.00 0.01 -0.13 -0.33 0.00
## expenditure income english
## 0.00 0.70 -0.15
# using a dataframe to show matrix and lm results
results <- data.frame(matrix_results=betas, lm_results=lm_betas)
print(results)
## matrix_results lm_results
## int 656.15 656.15
## students 0.00 0.00
## teachers 0.01 0.01
## calworks -0.13 -0.13
## lunch -0.33 -0.33
## computer 0.00 0.00
## expenditure 0.00 0.00
## income 0.70 0.70
## english -0.15 -0.15
The coefficients obtained from using matrix algebra is the same as those produced from the linear model regression model.