States.csv
Education and Related Statistics for the U.S. States
The States data frame has 51 rows and 8 columns. The observations are the U. S. states and Washington, D. C.
region U. S. Census regions. A factor with levels: ENC, East North Central; ESC, East South Central; MA, Mid-Atlantic; MTN, Mountain; NE, New England; PAC, Pacific; SA, South Atlantic; WNC, West North Central; WSC, West South Central.
pop Population: in 1,000s.
SATV Average score of graduating high-school students in the state on the verbal component of the Scholastic Aptitude Test (a standard university admission exam).
SATM Average score of graduating high-school students in the state on the math component of the Scholastic Aptitude Test.
percent Percentage of graduating high-school students in the state who took the SAT exam.
dollars State spending on public education, in $1000s per student.
pay Average teacher’s salary in the state, in $1000s.
I’ll create a linear regression for average score in SAT math as function of dollars spent in public education.
# Read the data
States <- read.csv("https://raw.githubusercontent.com/L-Velasco/DATA605_SP19/master/HW/States.csv", stringsAsFactors = FALSE)
str(States)
## 'data.frame': 51 obs. of 8 variables:
## $ X : chr "AL" "AK" "AZ" "AR" ...
## $ region : chr "ESC" "PAC" "MTN" "WSC" ...
## $ pop : int 4041 550 3665 2351 29760 3294 3287 666 607 12938 ...
## $ SATV : int 470 438 445 470 419 456 430 433 409 418 ...
## $ SATM : int 514 476 497 511 484 513 471 470 441 466 ...
## $ percent: int 8 42 25 6 45 28 74 58 68 44 ...
## $ dollars: num 3.65 7.89 4.23 3.33 4.83 ...
## $ pay : int 27 43 30 23 39 31 43 35 39 30 ...
dim(States)
## [1] 51 8
head(States)
## X region pop SATV SATM percent dollars pay
## 1 AL ESC 4041 470 514 8 3.648 27
## 2 AK PAC 550 438 476 42 7.887 43
## 3 AZ MTN 3665 445 497 25 4.231 30
## 4 AR WSC 2351 470 511 6 3.334 23
## 5 CA PAC 29760 419 484 45 4.826 39
## 6 CO MTN 3294 456 513 28 4.809 31
# correlation
cor(States$dollars, States$SATM)
## [1] -0.4844477
fit <- lm(SATM ~ dollars, data = States)
plot(States$dollars, States$SATM)
abline(fit)
fit
##
## Call:
## lm(formula = SATM ~ dollars, data = States)
##
## Coefficients:
## (Intercept) dollars
## 560.37 -12.17
There seems to be an inverse relationship between SATM and dollars variables, and a weak correlation at that. The line is inverse which suggest a higher average SATM score when there are less money spent in education.
Based on the model, the y-intercept is 560.37 and slope is -12.17
SATM = 560.37 + -12.17 * dollars
summary(fit)
##
## Call:
## lm(formula = SATM ~ dollars, data = States)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70.72 -17.85 -1.98 16.48 75.51
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 560.374 16.801 33.353 < 2e-16 ***
## dollars -12.169 3.139 -3.876 0.000315 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.55 on 49 degrees of freedom
## Multiple R-squared: 0.2347, Adjusted R-squared: 0.2191
## F-statistic: 15.03 on 1 and 49 DF, p-value: 0.0003155
With a low p-value of 0.000315, there is a great probability that dollars is relevant or signifant in the model. The reported R-squared of 0.2347 for this model means that the model explains 23.47 percent of the data’s variation.
par(mfrow=c(2,2)) # Change the panel layout to 2 x 2
plot(fit)
par(mfrow=c(1,1)) # Change back to 1 x 1
The Residual vs Fitted plot seems to show equal random residuals above and below the horizontal line. The Normal Q-Q seems not too concerning with most points along the line. The scale-location spread seems that there is likely a pattern with more points on the right side of the plot. There could be potential problematic point for the 41st observation as seen on the Residuals vs Leverage plot.
Overall, more variables and/or transformation may help for better fit that could explain more of the data’s variation.