Dataset:

States.csv

Description:

Education and Related Statistics for the U.S. States

Dimension:

The States data frame has 51 rows and 8 columns. The observations are the U. S. states and Washington, D. C.

Data Definition:

region U. S. Census regions. A factor with levels: ENC, East North Central; ESC, East South Central; MA, Mid-Atlantic; MTN, Mountain; NE, New England; PAC, Pacific; SA, South Atlantic; WNC, West North Central; WSC, West South Central.

pop Population: in 1,000s.

SATV Average score of graduating high-school students in the state on the verbal component of the Scholastic Aptitude Test (a standard university admission exam).

SATM Average score of graduating high-school students in the state on the math component of the Scholastic Aptitude Test.

percent Percentage of graduating high-school students in the state who took the SAT exam.

dollars State spending on public education, in $1000s per student.

pay Average teacher’s salary in the state, in $1000s.

I’ll create a linear regression for average score in SAT math as function of dollars spent in public education.

# Read the data
States <- read.csv("https://raw.githubusercontent.com/L-Velasco/DATA605_SP19/master/HW/States.csv", stringsAsFactors = FALSE)

str(States)
## 'data.frame':    51 obs. of  8 variables:
##  $ X      : chr  "AL" "AK" "AZ" "AR" ...
##  $ region : chr  "ESC" "PAC" "MTN" "WSC" ...
##  $ pop    : int  4041 550 3665 2351 29760 3294 3287 666 607 12938 ...
##  $ SATV   : int  470 438 445 470 419 456 430 433 409 418 ...
##  $ SATM   : int  514 476 497 511 484 513 471 470 441 466 ...
##  $ percent: int  8 42 25 6 45 28 74 58 68 44 ...
##  $ dollars: num  3.65 7.89 4.23 3.33 4.83 ...
##  $ pay    : int  27 43 30 23 39 31 43 35 39 30 ...
dim(States)
## [1] 51  8
head(States)
##    X region   pop SATV SATM percent dollars pay
## 1 AL    ESC  4041  470  514       8   3.648  27
## 2 AK    PAC   550  438  476      42   7.887  43
## 3 AZ    MTN  3665  445  497      25   4.231  30
## 4 AR    WSC  2351  470  511       6   3.334  23
## 5 CA    PAC 29760  419  484      45   4.826  39
## 6 CO    MTN  3294  456  513      28   4.809  31

Visualize and fit a regression line

# correlation
cor(States$dollars, States$SATM)
## [1] -0.4844477
fit <- lm(SATM ~ dollars, data = States)
plot(States$dollars, States$SATM)
abline(fit)

fit
## 
## Call:
## lm(formula = SATM ~ dollars, data = States)
## 
## Coefficients:
## (Intercept)      dollars  
##      560.37       -12.17

Linear Regression Model

There seems to be an inverse relationship between SATM and dollars variables, and a weak correlation at that. The line is inverse which suggest a higher average SATM score when there are less money spent in education.

Based on the model, the y-intercept is 560.37 and slope is -12.17

SATM = 560.37 + -12.17 * dollars

Evaluating the Quality of the Model

summary(fit)
## 
## Call:
## lm(formula = SATM ~ dollars, data = States)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -70.72 -17.85  -1.98  16.48  75.51 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  560.374     16.801  33.353  < 2e-16 ***
## dollars      -12.169      3.139  -3.876 0.000315 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.55 on 49 degrees of freedom
## Multiple R-squared:  0.2347, Adjusted R-squared:  0.2191 
## F-statistic: 15.03 on 1 and 49 DF,  p-value: 0.0003155

With a low p-value of 0.000315, there is a great probability that dollars is relevant or signifant in the model. The reported R-squared of 0.2347 for this model means that the model explains 23.47 percent of the data’s variation.

Residual Analysis

par(mfrow=c(2,2)) # Change the panel layout to 2 x 2
plot(fit)

par(mfrow=c(1,1)) # Change back to 1 x 1

The Residual vs Fitted plot seems to show equal random residuals above and below the horizontal line. The Normal Q-Q seems not too concerning with most points along the line. The scale-location spread seems that there is likely a pattern with more points on the right side of the plot. There could be potential problematic point for the 41st observation as seen on the Residuals vs Leverage plot.

Overall, more variables and/or transformation may help for better fit that could explain more of the data’s variation.