Data

The dataset used in this week’s work is the CASchools dataset from the AER package. It is a cross-sectional data that records test performance, school characteristics and student demographic backgrounds for school districts in California between 1998 and 1999. There are 420 observations and 14 variables.

Variables:

district - school district code
school - school name
county - factor indicating county
grades - factor indicating grade span of district
students - total enrollment in a school
teachers - number of teachers in a school
calworks - percent qualifying for CalWorks (income assistance)
lunch - percent qualifying for reduced-price lunch
computer - number of computers in a school
expenditure - expenditure per student
income - average income in the district
english - percent of English learners
read - average reading score
math - average math score

The dependent variable (y) for the model is math which reflects average math score in a school.

The independent variables (x) are students, teachers, calworks, lunch, computer, expenditure, income, english.

Estimating Equation

\[ Average Math Score = \beta0+\beta1Students+\beta2Teachers+\beta3CalWorks+\beta4Lunch+...+\beta7Income+\beta8English \]

Checking Dataset

#loading dataset into R 

data("CASchools")
df1 <- CASchools

#checking variables 

str(df1)

## 'data.frame':    420 obs. of  14 variables:
##  $ district   : chr  "75119" "61499" "61549" "61457" ...
##  $ school     : chr  "Sunol Glen Unified" "Manzanita Elementary" "Thermalito Union Elementary" "Golden Feather Union Elementary" ...
##  $ county     : Factor w/ 45 levels "Alameda","Butte",..: 1 2 2 2 2 6 29 11 6 25 ...
##  $ grades     : Factor w/ 2 levels "KK-06","KK-08": 2 2 2 2 2 2 2 2 2 1 ...
##  $ students   : num  195 240 1550 243 1335 ...
##  $ teachers   : num  10.9 11.1 82.9 14 71.5 ...
##  $ calworks   : num  0.51 15.42 55.03 36.48 33.11 ...
##  $ lunch      : num  2.04 47.92 76.32 77.05 78.43 ...
##  $ computer   : num  67 101 169 85 171 25 28 66 35 0 ...
##  $ expenditure: num  6385 5099 5502 7102 5236 ...
##  $ income     : num  22.69 9.82 8.98 8.98 9.08 ...
##  $ english    : num  0 4.58 30 0 13.86 ...
##  $ read       : num  692 660 636 652 642 ...
##  $ math       : num  690 662 651 644 640 ...

summary(df1)

##    district            school                  county      grades   
##  Length:420         Length:420         Sonoma     : 29   KK-06: 61  
##  Class :character   Class :character   Kern       : 27   KK-08:359  
##  Mode  :character   Mode  :character   Los Angeles: 27              
##                                        Tulare     : 24              
##                                        San Diego  : 21              
##                                        Santa Clara: 20              
##                                        (Other)    :272              
##     students          teachers          calworks          lunch       
##  Min.   :   81.0   Min.   :   4.85   Min.   : 0.000   Min.   :  0.00  
##  1st Qu.:  379.0   1st Qu.:  19.66   1st Qu.: 4.395   1st Qu.: 23.28  
##  Median :  950.5   Median :  48.56   Median :10.520   Median : 41.75  
##  Mean   : 2628.8   Mean   : 129.07   Mean   :13.246   Mean   : 44.71  
##  3rd Qu.: 3008.0   3rd Qu.: 146.35   3rd Qu.:18.981   3rd Qu.: 66.86  
##  Max.   :27176.0   Max.   :1429.00   Max.   :78.994   Max.   :100.00  
##                                                                       
##     computer       expenditure       income          english      
##  Min.   :   0.0   Min.   :3926   Min.   : 5.335   Min.   : 0.000  
##  1st Qu.:  46.0   1st Qu.:4906   1st Qu.:10.639   1st Qu.: 1.941  
##  Median : 117.5   Median :5215   Median :13.728   Median : 8.778  
##  Mean   : 303.4   Mean   :5312   Mean   :15.317   Mean   :15.768  
##  3rd Qu.: 375.2   3rd Qu.:5601   3rd Qu.:17.629   3rd Qu.:22.970  
##  Max.   :3324.0   Max.   :7712   Max.   :55.328   Max.   :85.540  
##                                                                   
##       read            math      
##  Min.   :604.5   Min.   :605.4  
##  1st Qu.:640.4   1st Qu.:639.4  
##  Median :655.8   Median :652.5  
##  Mean   :655.0   Mean   :653.3  
##  3rd Qu.:668.7   3rd Qu.:665.9  
##  Max.   :704.0   Max.   :709.5  
##

# creating a dataframe with required variables only 

df2 = subset(df1, select = -c(district, school, county, grades, read) )

Model

\[ y=X\beta + \in \]

y - vector of dependent variable (math)

X - matrix of feature variables

\(\beta\) - vector of parameters to be estimated

\(\in\) - vector of error term

# dependent variable 

y <- as.vector(df2$math)

# creating a matrix of feature variables from CASchools

X <- as.matrix(df2[-ncol(df2)])

# replicating the values in x according to the number of rows in CASchools

int <- rep(x = 1, times = length(y))

# adding intercept column to X

X <- cbind(int, X)
remove(int)

Matrix Algebra

#implementing closed-form solution to X and y 

betas <- solve(t(X) %*% X) %*% t(X) %*% y

betas <- round(x = betas,digits = 2)

betas

##               [,1]
## int         656.15
## students      0.00
## teachers      0.01
## calworks     -0.13
## lunch        -0.33
## computer      0.00
## expenditure   0.00
## income        0.70
## english      -0.15

lm() command

# building linear regression model

lm_model <- lm(math ~ ., data=df2)

lm_betas <- round(x = lm_model$coefficients, digits = 2)

lm_betas

## (Intercept)    students    teachers    calworks       lunch    computer 
##      656.15        0.00        0.01       -0.13       -0.33        0.00 
## expenditure      income     english 
##        0.00        0.70       -0.15

# using a dataframe to show matrix and lm results 
results <- data.frame(matrix_results=betas, lm_results=lm_betas)

print(results)

##             matrix_results lm_results
## int                 656.15     656.15
## students              0.00       0.00
## teachers              0.01       0.01
## calworks             -0.13      -0.13
## lunch                -0.33      -0.33
## computer              0.00       0.00
## expenditure           0.00       0.00
## income                0.70       0.70
## english              -0.15      -0.15

The coefficients obtained from using matrix algebra is the same as those produced from the linear model regression model.

Discussion: OLS