Eugenia Theodosopoulos
Statistics
10/2919
The data I analyzed was Total Group Mean SAT Scores of College-Bound Seniors from 1972-2014 in California. The data was obtained from the College Board website in the SAT score archive. The data represents the mean scores for college bound seniors in California between 1972 and 2014, separated into three subjects, which are each separated into three categories. The data available is Critical Reading, Math, and Writing, separated into Male, Female, and Total. According to the College Board, “Because the accuracy of self-reported information has been documented and the college-bound population is relatively stable from year to year, SAT Questionnaire responses from these students can be considered highly accurate” (www.research.collegeboard.org). The data was presented in chart form on the exported pdf, and I converted the data into Excel. The y-axis is years spanning from 1972 to 2014, and the x-axis was each subject (Critical Reading, Math, and Writing) separated by gender (or total). The sample size was medium sized, with 258 data points. There are no units as they are merely scores. All the data is from college-bound high school seniors and each student’s score is only counted once, no matter how many times they took it, as they only included their most recent scores. Possible limitations of the data set are the amount of preparation each student endured before testing, which can be linked to socioeconomic status as well as location. Since this data is only mean scores per year per gender, there is no breakdown of level of preparation, access to practice materials, or other societal factors that may affect performance. The variables I chose to examine were male SAT math scores and female SAT math scores. I chose this in order to see if there was a correlation between the two and what the relationship between the possible increase of the scores were.
This data is important to analyze as it provides insight into a process that almost every teenager endures. While the SAT is only one standardized test out of the many one might take, and the sample in this data is from a set number of years, this data is still valuable. It is often times debated if there is a marked difference between female and male performance in school, specifically in the subject of math, and data such as this allows us to begin to explore this question. If there were even more data to compare, one may be able to draw an even larger picture about the potential causes to the correlations discovered in this particular data set.
Scatterplots, Association, and Correlation
This is the distribution of each variable on its own:
Below are numerical summaries of Male SAT Math Scores followed by Female SAT Math Scores.
summary(SAT_Stat_Data$`Math Male`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 515.0 520.5 525.0 525.9 532.0 538.0
summary(SAT_Stat_Data$`Math Female`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 473.0 479.5 489.0 488.4 499.0 504.0
When examining this numerical data, I noticed the differences between the ranges of scores. The min and max of female scores are 473 and 504, respectively, and the min and max of male scores are 515 and 538, respectively. The maximum score that female test takers received was below the minimum that male test takers received, meaning there is no overlap in their test scores. The mean and median of both data sets respectively are almost identical (male median: 525, male mean: 525.9) (female median: 489, female mean: 488.9). This means that their distributions are symmetric and have no skew.
Below is the plot of the Male SAT Math Scores and Female SAT Math Scores, plotted by R without removing the outliers. Despite the fact that this plot is technically “including outliers,” the data itself has no outliers. An outlier is a data point that is much larger or smaller than the point adjacent, and generally falls a visible distance away from the majority of the data when viewed visually on a plot. In this case, an outlier would be an exceptionally high math score for either gender. I determined that there were no outliers using R functions, and is it also visible in the plots. I plotted two graphs, one “with outliers” and one “without,” and the two were identical. This is why I chose to only include one of the two I plotted originally.
The direction of the plot is positive, and the shape is linear as all of the data points fall roughly along one very clear line. I would assume the strength of the correlation to be very strong.
I also used a different graphing package, ggplot, to make a scatterplot. In the rest of this analysis both methods (base R and ggplot) will be provided for visualizing bivariate data. Here is the same scatterplot as above using ggplot.
Before moving on I made sure to check the conditions of whether my methods of analysis are appropriate. The correlation coefficient is appropriate because the underlying relationship is linear. The Straight Enough Condition is met because the scatterplot shows that the data follows a generally straight line. Additionally, there are no outliers (found through a numerical analysis but can also be seen visually in the graph).
I believe the conditions for linear regression are met by this data set as it does shows linear association. I believe it to be a strong positive linear correlation.
The correlation coefficient:
## [1] 0.972051
The correlation coefficient indicates that the relationship between the two variables is positive and strong.
Fitting a linear regression model
I created a linear model in order to summarize the relationship between my quantitative variables, since it was visibly straight.
Below is a model using the Female SAT Math Scores (independent variable) to explain the Male SAT Math Scores (response variable).
#fitting a linear model (regression) without local revenue outliers
linear_reg<-lm(SAT_Stat_Data$`Math Male`~SAT_Stat_Data$`Math Female`)
summary(linear_reg)
##
## Call:
## lm(formula = SAT_Stat_Data$`Math Male` ~ SAT_Stat_Data$`Math Female`)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7154 -1.3793 -0.0433 0.9259 3.3154
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 212.74046 11.81583 18.00 <2e-16 ***
## SAT_Stat_Data$`Math Female` 0.64123 0.02419 26.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.624 on 41 degrees of freedom
## Multiple R-squared: 0.9449, Adjusted R-squared: 0.9435
## F-statistic: 702.9 on 1 and 41 DF, p-value: < 2.2e-16
Here is a visualization of the regression line using ggplot.
The regression line moves upwards, meaning that as it travels horizontally up the x-axis, it goes up the y-axis. The slope is positive, which also corroborates this. The slope is 0.64123, which means that for every unit travelled in x, the y value goes up by 0.64123. In the context of these variables, this means that as the mean female math score increases, the mean male SAT score increases. One might assume that this means the more female scores increase, male scores increase at a lower rate. The r value is 0.972051, which means that this correlation is very high, and we have a good chance of being able to draw accurate conclusions from the data. The data points all fall very close to the line which corroborates this. The R^2 value is 0.9449. R^2 is the proportion of variation on y as explained by the least squared regression line of y vs x, which means how reliable this graph and regression is in explaining the amount of change in male SAT math scores, in this case in relation to female SAT math scores. The R^2 value is very high, it is 94%. This means that 94% of the variation data for male SAT math scores can be explained by the graph. This also corroborates the previous conclusions drawn from the r value and slope.
Assessing fit: Model diagnostics
In order to assess the fit of my data, I produced residual plots since residuals reveal how well the model works. Since the assumption when using a linear model is that the relationship between my variables is a straight line, I wanted to examine the parts of the data that have not been modeled in my linear model (the residuals)
The first plot (below) is a scatterplot of the fitted/predicted values (aka the male SAT math scores that the model predicts for each observation of female SAT math scores) against the residuals. This graph is to ensure that the residuals are not producing a non-linear pattern.
In this graph, if there is no non-linear pattern, the data points should be scattered randomly and the red line should be straight. However, on the graph produced for my variables, the line curves down heavily in the center of the data. When comparing this to my regression line, it can be seen that the data falls below the line on that graph. This means that from between the female scores of ~485-495, the male scores produced are lower than predicted. At the tail ends of the graph, the red line curves upwards. This can also be observed in the plot of the regression line, as there are lone data points above the regression line at the tip and tail of the line.
Next, I produced a normal QQ plot (below). This graph shows if the residuals are normally distributed by whether or not they fall along the dotted line. In the plot of this data, the residuals fall very well along the dottted line, except for the small tail at the bottom left, which curves above the line. This means that the distribution of my data is very slightly right skewed.
The last graph I produced was the plot of standardized residuals versus leverage (below). This plot finds influential points that may determine the regression line, such as outliers. When visualizing this plot, there is no dotted line visible. This supposed dotted line indicates the Cook’s distance for your data. If data points are outside of Cook’s distance, it means they are influential to the regression——essentially, they are outliers. If the data did include points outside of Cook’s distance, removing them would make the fit of the model better. However, since our data has no outliers, this graph does not reveal any new information.
Through an examination of the data, I concluded that there is a strong correlation between the rate of increase of female SAT math scores and the rate of increase of male SAT math scores. The slope of 0.64123 means that as the female SAT math score goes up by one, the male score only goes up by 0.6. Thus, the female scores are increasing at a higher rate than male scores are. However, the minimum of the male data is 515 and the maximum of the female data is 504, so at no point does the data overlap. This means that while the female scores have increased at a more rapid rate, they have not caught up to the numerical values of male scores. In fact, in 2014 (the latest year in the data set available), the mean male SAT math score is 530 and the mean female SAT math score is 499. Even if the slope suggests that female scores are increasing more rapidly than male, the difference between the most recent scores (in this data set) is still a considerable 31 points.
The r value is 0.972051, which means that the correlation between the variables is high and conclusions drawn from the data are likely to be accurate. The R^2 value is 0.9449. R^2 is the proportion of variation on y as explained by the least squares regression line of y vs x, which means how reliable this graph and regression is in explaining the amount of change in male SAT math scores, in this case in relation to female SAT math scores. The R^2 value is very high, as it is 94%. This means that 94% of the variation data for male SAT math scores can be explained by the graph.
I would be curious to compare this data to math grades received by college-bound seniors in California from 1972-2014. While the curriculum is schools varies, the SAT is standardized and remains the same. Thus, I would be curious to see if the standardization of the test and scores is the cause for such a strong linear correlation or if female and male math knowledge can be graphed in that sense without being standardized. Broader knowledge could also assist in drawing conclusions about whether males are actually more knowledgeable about math than females or are merely better at standardized testing, and can also assist in understanding whether or not female math knowledge over the years has actually increased at a higher rate than male knowledge or if they have just become better at standardized testing.