Module 1 Exam
Section 1: Introduction
The purpose of the document is to illustrate how Ohio colleges and universities compare to colleges and universities in other states. The data used in this document is only a subset of all college scorecard data available for the 2018-2019 reporting period and is from the US Department of Education, which is one of the many federal organizations responsible for collecting and reporting data on institutions of higher education. Part of the data collected and reported by the department of education includes all institutions of post-secondary education receiving some form of financial aid, and various statistics about each institution.
The analysis in this document is, in hope, to help an individual to better understand how institutions of post-secondary education in Ohio compare to other schools regionally and nationally. For example, how the cost of attendance varies by family income for all institutions, how Ohio compares to neighboring states with respect to the percent of institutions that are HBCU’s, men only or women only, how the average family income of Xavier University students compares nationally and within Ohio, and how Xavier University of Ohio compares with the other ‘Xavier’ schools in the country.
Section 2: Required Packages
The packages required for this markdown are:
| Package | Summary |
|---|---|
| tidyverse | The tidyverse collection of packages |
| rmdformats | RMarkdown themes |
| knitr | RMarkdown documents |
| DT | Javascript enabled data tables |
| stargazer | Fancy regression tables |
| corrplot | Simple correlation plots |
| PerformanceAnalytics | Detailed plots and tables for analytics |
| pander | Pretty summary tables output |
| lubridate | Working with date formats |
| scales | Providing methods for breaks and labels for axes |
Section 3: Data Preparation (Cleaning & Wrangling)
Source Data Explained
Data was imported from http://asayanalytics.com/scorecard_csv. After cleaning & wrangling, there are a total of 7,115 entries and 34 columns in the scorecard data frame, where 5 columns are added columns for the ease of further analysis. There are 3 parts to data cleaning & wrangling: Data Conformity, Blatant Data Errors, and Dummy Variables.
Data conformity: - Missing values are recorded as “0” since data type of the fields is numeric. - Character values recorded as “NULL” and “Not Available” are also replaced with “0” and then converted to numeric values.
Blatant Data Errors: - Anomalies and errors are addressed. For example, percenteages should never exceed 100% or be negative, and any number except longitude should not be negative.d Any rows with such values would be removed. However, there are not such values in this subset of data, so no rows were removed. - Control of institution where there are multiple variations on 3 possible values (1 - Public, 2 - Private nonprofit, and 3 - Private for-profit). Entries with character values are converted into appropriate number of control, and the column is converted into an appropriate data type, numeric. - Other fields that were character type are also converted into numeric as they should be, including FAMINC, UGDS, CONTROL, and HBCU. Data type of ZIP entity is also converted from numeric to character data type.
Dummy Variables: - A dummy variable for average family income of an institution whose income is greater than the median of household income for Ohio in 2019, which is $54,021. Values are recorded as TRUE if average family income is higher than the median of Ohio household income in 2019, otherwise recorded as FALSE. - Dummy variables indicating whether the institution is a ‘University,’ ‘College,’ or ‘Other’ based on the words used in the institution name. Values are recorded as TRUE if name of institution matches any of the above condition. - A dummy variable for institutions in states bordering Ohio. Values are recorded as TRUE if institutions are in states bordering Ohio.
Below is a summary table explaining each variable in the data set.
| Variable Name | Explanation |
|---|---|
| ID | Unique ID for institution |
| INSTNM | Institution Name |
| CITY | City of institution |
| STABBR | State postcode |
| ZIP | ZIP code |
| CONTROL | Control of institution (1 - Public, 2 - Private nonprofit, 3 - Private for-profit) |
| LOCALE | Locale of institution |
| LATITUDE | Latitude |
| LONGITUDE | Longitude |
| HBCU | Flag for Historically Black College and University |
| MENONLY | Flag for men-only institution |
| WOMENONLY | Flag for women-only institution |
| ADM_RATE | Admission rate |
| ACTCM25 | 25th percentile of the ACT cumulative score |
| ACTCM75 | 75th percentile of the ACT cumulative score |
| ACTCMMID | Midpoint of the ACT cumulative score |
| SAT_AVG | Average SAT equivalent score of students admitted |
| UGDS | Enrollment of undergraduate certificate/degree-seeking students |
| COSTT4_A | Average cost of attendance (academic year institutions) |
| AVGFACSAL | Average faculty salary |
| PCTPELL | Percentage of undergraduates who receive a Pell Grant |
| PCTFLOAN | Percentage of all undergraduates students receiving a federal student loan |
| AGE_ENTRY | Average age of entry |
| FEMALE | Share of female students |
| MARRIED | Share of marrie students |
| DEPENDENT | Share of dependent students |
| VETERAN | Share of veteran students |
| FIRST_GEN | Share of first-generation students |
| FAMINC | Average family income in real 2015 dollars |
| Above MED | TRUE if FAMINC is above the median of Ohio household income of $54,021 |
| University | TRUE if an institution name contains the word ‘University’ |
| College | TRUE if an institution name contains the word ‘College’ |
| Other | TRUE if an institution name contains neither the word ‘University’ nor ‘College’ |
| BORDEROH | TRUE is an institution is in stated bordering Ohio. |
Data Table
Below is a sample data table of the Institutions Scorecard.
Summary Information
Summary 1
Below are summary tables summarizing the overall average cost of attendance of universities and colleges. It shows that University has higher average cost of attendance than College
| “College” | Average Cost of Attendance |
|---|---|
| College | 23065 |
| “University” | Average Cost of Attendance |
|---|---|
| University | 30373 |
Summary 2
The tables below shows institutions with highest midpoint of the ACT cumulative score (ACTCMMID) and another with highest average SAT equivalent score of students admitted (SAT_AVG). I found the result interesting because I anticipated that the scores would be higher in top institution like MIT and Harvard. However, the result shows that the institution with the both highest of ACTCMMID and SAT_AVG is California Institute of Technology.
| Institution Name | Midpoint of ACT cum score |
|---|---|
| California Institute of Technology | 35 |
| Carnegie Mellon University | 34 |
| Franklin W Olin College of Engineering | 34 |
| Harvard University | 34 |
| Harvey Mudd College | 34 |
| Institution Name | SAT Score |
|---|---|
| California Institute of Technology | 1558 |
| Rice University | 1535 |
| Massachusetts Institute of Technology | 1532 |
| Harvey Mudd College | 1527 |
| Johns Hopkins University | 1524 |
Summary 3
This last table shows institutions in Ohio whose average family income (as recorded in 2015 dollars) is higher than the median of Ohio in 2019. The result shows that Xavier University is ranked number 4 on the list, with average family income in 2015 of $114,330, which is more than twice as high as the median of Ohio household income in 2019.
| Institution Name | Average Family Income |
|---|---|
| Kenyon College | 137788 |
| University of Dayton | 134621 |
| Oberlin College | 130188 |
| Xavier University | 114330 |
| John Carroll University | 111360 |
| Case Western Reserve University | 109858 |
| The College of Wooster | 107065 |
| Cleveland Institute of Music | 104656 |
| Denison University | 103917 |
| Franciscan University of Steubenville | 103566 |
Section 4: Simple Analysis & Trends
Visualization 4.1
The bar chart below shows the number of institutions in Ohio and compare it to the number of institutions in the states bordering Ohio. It shows that Pennsylvania has the most number of institutions, with 378 institutions total.
Visualization 4.2
Point graph below illustrates how the cost of attendance varies by family income for all institutions. The illustration shows that the higher the average family income, the higher the cost of attendance an institution gets since family has enough income to pay for more expensive institution.
What is interesting about this is the two outliers at the top and one on the right. One outlier is where cost of attendance is very high but the average family income is less than $50,000 in 2015. This may indicate that the institution give out many scholarships and grants to students. Another outlier is where the average family income is at the highest out of all institutions but cost of attendance for the school is much cheaper than the first outlier.
Visualization 4.3
The bar chart below compares the total number of undergraduates at all institutions attending each of the 3 institutional control types. It shows that public institutions have the most number of undergraduate students attending.
Visualization 4.4
Faceted point graphs below show the relationship between SAT test scores and average family income for Ohio and each of the states that border Ohio. The visualization shows that the higher the average family income, the higher the average of SAT score seems to be. From the chart, West Virginia and Pennsylvania are the two states that are on the opposite extreme - there seem to be a lot more institutions in PA than in WV, plus, more rich family in PA than in WV.
Visualization 4.5
How does Ohio compare to neighboring states with respect to the percent of institutions that are HBCU’s, men-only or women-only?
The bar chart below shows a number of institutions in Ohio and bordering states that are HBCU’s, men-only, or women-only. The result shows that Ohio has the second most number of such institutions, the first is PA.
Note that some institutions are flagged as both HBCU and Men-only. This may create a biased result in the chart below.
Section 5: Directed Analysis
5.1
Question: Do you find support for the old adage: “Private schools cost more than public schools”?
From the bar chart below, I do find support for the old adage: “Private schools cost more than public schools.” The chart shows the average cost of attendance of all institutions in each category. It shows that the average cost of attendance of private nonprofit schools is $26,264, which is more than twice higher than the average cost of attendance of public institutions.
5.2
Question: How does the average family income of Xavier University students compare nationally and within Ohio?
From the tables below, comparing with other institutions within Ohio and nationally, the average family income of Xavier University students, $114,330, is much higher, whereas the average family income within Ohio and nationally are only 39,848 and 35,778 dollars, respectively.
| Institution Name | Average Family Income |
|---|---|
| Xavier University | $114,330 |
| State | Average Family Income |
|---|---|
| OH | $39,848 |
| “Nationally” | Average Family Income |
|---|---|
| Nationally | $35,778 |
5.3
Question: How much variance exists in the cost of attending an Ohio college or university? How does this variance compare to colleges and universities in the states that border Ohio?
Variance in the cost of attending an Ohio college or university is $191,608,900, which is similar to the variance in the cost of attending colleges and universities in Indiana and Michigan, whereas the variances in Kentucky and West Virginia are lower (with WV being the lowest), and the variance is highest in Philadelphia, 264,049,015 dollars.
5.4
Question: What types of schools have the highest and lowest percentage of students receiving Pell grants?
As shown in the bar chart below, the type of school that has the highest percentage of students receiving Pell grants is private for-profit institutions with 60.99%, whereas the type with lowest percentage is private nonprofit institutions, with 41.26 percents of students receiving Pell grants.
5.5
Question: Do schools with wealthier student families and high dependency rates have a higher or lower average age of entry?
From the points graph below, the higher the average student family income and higher dependency rates, the lower the average of entry.
5.6
Question: Does there appear to be a relationship between the tuition charged by the institution and the average faculty salary?
Yes, there appears to be a relationship between the tuition charged by the institution and the average faculty salary. From the jitter plot below, the higher the cost of attendance, the higher the average faculty salary.
Section 6: Self-Directed Analysis
6.1
Instruction: Compare the relationship between the average cost of attendance, the percent of students receiving a Pell grant, the percent of students receiving a federal student loan and the average family income in whatever way you choose. If one of these variables was thought to cause the others, which would you say it is?
Visualization 4.2 already showed that the higher the average cost of attendance, the average family income of students is also higher. From the graphs below, there are a few relationships shown. First, there is a correlation between percentage of students receiving Pell grants and receiving Federal loans and family income - there are more students receiving Pell grants and Federal loans when the average family income is lower.
Note that an outlier is taken out from the following graphs for clearer visualization
Comparing Pell Grant & Federal Loan with Family Income and Cost of Attendance
6.2
Instructions: How does Xavier University of Ohio compare with the other ‘Xavier’ schools in the country? Identify at least 2 similarities and 2 differences between our Xavier University and the other Xavier schools.
Similarities:
All ‘Xavier’ institutions are private schools.
The admission rates of all ‘Xavier’ institutions are +/-10 from the average admission rate within its control type (private nonprofit).
All ‘Xavier’ institutions having Private nonprofit control type are all Universities.
Differences:
Xavier University of Ohio has the highest average family income out of all ‘Xavier’ institutions.
Xavier University of Ohio also has the highest average cost of attendance.
Xavier University of Ohio has the lowest percentage of female and first generation students.
Xavier University of Ohio has the lowest percetange of students receiving Pell Grants and Federal Student Loans.
Note that Xavier College School of Nursing of California is not included as its cost of attendance is missing in the data
Family Income and Control Type
Cost of Attendance and Family Income
| Institution Name | Family Income | Cost of Attendance |
|---|---|---|
| Xavier University | $114,330 | $50,880 |
| Xavier University of Louisiana | $49,214 | $33,593 |
| Saint Xavier University | $49,968 | $38,775 |
Cost of Attendance and Rate of Admission
| Institution Name | Admission Rate | Cost of Attendance |
|---|---|---|
| Xavier University of Louisiana | 64% | $33,593 |
| Saint Xavier University | 75% | $38,775 |
| Xavier University | 74% | $50,880 |
| Control Type | Admission Rate |
|---|---|
| Private nonprofit | 66.06% |
Female and First Generation Rates
| Institution Name | Female | First Generation |
|---|---|---|
| Saint Xavier University | 66% | 49% |
| Xavier College School of Nursing | 78% | 44% |
| Xavier University | 57% | 16% |
| Xavier University of Louisiana | 75% | 27% |
Pell Grant and Federal Student Loan Rates
| Institution Name | Pell Grant Rate | Federal Loan Rate |
|---|---|---|
| Saint Xavier University | 54% | 77% |
| Xavier College School of Nursing | 82% | 100% |
| Xavier University | 16% | 55% |
| Xavier University of Louisiana | 54% | 67% |
Control Type and Institution Type
| Institution Name | University | Control Type |
|---|---|---|
| Saint Xavier University | TRUE | 2 |
| Xavier University | TRUE | 2 |
| Xavier University of Louisiana | TRUE | 2 |
6.3 Additional Self-Directed Analysis Using a Framework
6.3.1 Self-Directed Analysis #1
Question: What is the relationship between first generation rates, dependency rates, average age of entry and the average family income? Does the relationship make sense?
I find this question interesting because we already know that the higher the average family income, the higher the dependency rates. This question intends to see whether this correlation relates to the percentage of first generations.
Proposed Solution I intend to answer this question by doing two separate analysis. First, I will filter out rows that have zero percents of first generation rates and of dependency rates as well as rows that have $0 cost of attendance for more accurate results. Next, I use group_by() function to include average family income (for x-axis) and dependency rates (for factoring). Then, I will summarize the data by using first generation rates (y-axis) to see how it depends on the average family income. On the second analysis, the only difference is with the age of entry instead of dependency rates. Here, I will filter out AGE_ENTRY that are zero instead of the dependency rates. And the rest of wrangling is the same, only substitute DEPENDENT with AGE_ENTRY. Finally, I will choose to use point graph since all FAMINC, DEPENDENT, and AGE_ENTRY are all continuous variables.
Analysis From the result below, it shows that the higher the average family income, the higher the dependency rates, the lower the first generation rates, and the lower the average age of entry. This makes sense as first generations are usually older and are more independent.
6.3.2 Self-Directed Analysis #2
Question: How are SAT scores related to the percentage of student receiving Pell grants?
As we already know from Visualization 4.4 that the higher the average family income, the higher the SAT score seems to be, and from Analysis 6.1 that the higher the average family income, the lower the percentage of students receiving Pell grants. This question is intended to find out whether there is any correlation between the average SAT scores and the percentage of students receiving Pell grant.
Proposed Solution How I intend to answer this question is to first filter out rows with missing average SAT scores and percentage of Pell grants for accurate analysis, as well as to filter out an outlier of an SAT score that scored less than 750 points. Then, I will group data by the average SAT score and summarize the percentage of Pell grants received to see how Pell grant rates depend on the SAT score. Finally, I choose to use point graph as both SAT_AV and PCTPELL are continuous variables.
Analysis As shown in the graph below, it shows that most students who received Pell grants are those who scored lower in on the SAT exam. As the smooth graph shows, there is a downward slope indicating that the higher the average SAT scores, it is less likely for students to receive Pell grants.
Fun Stuff
This section exists just to explore fun data.
Latitude and Longitude Graph
The graph below plots out latitude and longitude of each institution in the data set. Its aesthetics shows the map of the United States. From the graph, it looks like most of the institutions from the data set are mostly from the Eastern side of the U.S.
Note: longitudes and latitudes of some institutions seem to be recorded incorrectly making the graphs look odd, so I had to filter out those institutions to get the map below.