This document was created for BAIS 496 Exam 1 to explore data from the “collegescorecard” data set and create a deliverable based on our findings. In the segments to follow I will go over the process I took to clean the data given and who I choose to answer and create each of my visuals for each question.
| Packages | Description |
|---|---|
| RMySQL | Allows for interface with SQL |
| readxl | The goal of readr is to provide a fast and friendly way to read rectangular data (like csv, tsv, and fwf) |
| tidyverse | The tidyverse is a coherent system of packages for data manipulation, exploration and visualization that share a common design philosophy |
| dplyr | provides a consistent set of verbs that help you solve the most common data manipulation challenges |
| DT | R data objects (matrices or data frames) can be displayed as tables on HTML pages, and DataTables provides filtering, pagination, sorting, and many other features in the tables |
| naniar | provides data structures and functions that facilitate the plotting of missing values and examination of imputations |
Everyone does data cleaning differenly and I am no exception. Where I started was with data uniformity. Some of the data read in with SQL syntax i.e “Null”. I converted these syntaxual errors to match R syntax and the majority of the data by replacing “NUll” values with “NA”. From there I unified the Control variable changing the numeric vaule into their character value, replaceing what would return 7 possible outcomes to what should have been 3. I accomplished this by using gsub. Next I did the same for the Locale variable. I found this to be more pleasing to the eye in my visualizations below, though not without concequence that I will explain with said visual. The final piece of cleaning I did early on was converting the variable UGDS(undergraduate degree seeking) from a character value to a numeric vaule
| Packages | Description |
|---|---|
| CITY | City |
| STABBR | State postcode |
| ZIP | ZIP code |
| CONTROL | Control of institution |
| 1 | Public |
| 2 | Private nonprofit |
| 3 | Private for-profit |
| LOCALE | Location/locale of Insitution |
| 11 | City: Large (population of 250,000 or more) Shown/used on consumer website. |
| 12 | City: Midsize (population of at least 100,000 but less than 250,000) |
| 13 | City: Small (population less than 100,000 |
| 21 | Suburb: Large (outside principal city, in urbanized area with population of 250,000 or more) |
| 22 | Suburb: Midsize (outside principal city, in urbanized area with population of at least 100,000 but less than 250,000) |
| 23 | Suburb: Small (outside principal city, in urbanized area with population less than 100,000) |
| 31 | Town: Fringe (in urban cluster up to 10 miles from an urbanized area) |
| 32 | Town: Distant (in urban cluster more than 10 miles and up to 35 miles from an urbanized area) |
| 31 | Town: Remote (in urban cluster more than 35 miles from an urbanized area) |
| 41 | Rural: Fringe (rural territory up to 5 miles from an urbanized area or up to 2.5 miles from an urban cluster) |
| 42 | Rural: Distant (rural territory more than 5 miles but up to 25 miles from an urbanized area or more than 2.5 and up to 10 miles from an urban cluster) |
| 43 | Rural: Remote (rural territory more than 25 miles from an urbanized area and more than 10 miles from an urban cluster) |
| LATITUDE | Latitude |
| LONGITUDE | Longitude |
| HBCU | Historically Black College and University |
| MENONLY | Flag for men-only college |
| WOMENONLY | Flag for women-only college |
| ADM_RATE | Admission rate |
| ACTCM25 | 25th percentile of the ACT cumulative score |
| ACTCM75 | 75th percentile of the ACT cumulative score |
| ACTCMMID | Midpoint of the ACT cumulative score |
| SAT_AVG | Average SAT equivalent score of students admitted |
| UGDS | Enrollment of undergraduate certificate/degree-seeking students |
| COSTT4_A | Average cost of attendance (academic year institutions) |
| AVGFACSAL | Average faculty salary |
| PCTPELL | Percentage of undergraduates who receive a Pell Grant |
| PCTFLOAN | Percent of all undergraduate students receiving a federal student loan |
| AGE_ENTRY | Average age of entry |
| FEMALE | Share of female students |
| MARRIED | Share of married students |
| DEPENDENT | Share of dependent students |
| VETERAN | Share of veteran students |
| FIRST_GEN | Share of first-generation students |
| FAMINC | Average family income in real 2015 dollars |
(datatable(df))
The following portion of this page will be the visualizations I used to answer each question given in our exam with my rational behind using the plots I choose if need be.
Show the number of institutions in Ohio and in each of the states that borders Ohio
Illustrate how the cost for attendance varies by family income for all institutions
Compare the number of undergraduates across each of the 3 institutional control types for all institutions
Show a relationship between ACT or SAT scores and family income across each of the states that border the state of Ohio.
5.1 Do you find support for the old adage: “Private schools cost more than public schools” Explain.
Based on the above boxplots, outliers aside, it would appear that the old adage does hold some weight as the mean for both control types of private schools far exceed the mean for public schools
5.2 How does the average family income of students at Xavier University compare nationally? Within Ohio?
## [1] 193
The avgerage family income at Xavier far exceeds the national average.
5.3 How does the cost of attending an Ohio ‘university’ compare to universities in states that border Ohio? What about universities nationally, not considering state?
The average cost of attendance throughout Ohio and the States that border it are fairly similar with Iniana having the highest average cost of attendance and West Virgina the lowest.
5.4 What schools have the highest and lowest percentage of undergraduate students receiving a Pell grant?
Above are datatables for school that have the highest possible precentage (100%) of students receiving Pell Grants and the lowest possible (0%) respectivley6.1 Compare the average cost of attendance across the number of undergraduates, the percent of students receiving a Pell grant, the average faculty salary and the average family income in whatever way you choose. If one of these variables was classified as a ‘dependent’ variable, which would you say it is and how would you evaluate the effect of the other variables on your dependent variable?
Compare the student populations of schools in heavily urbanized areas with those in very rural areas. Keep in mind, the type of school varies considerably by urban and rural areas. Do your best to control for this bias with the variables you have available to focus on differences within the populations of urban and rural schools and NOT the differences between the type of school.
For my additional reseach I wanted to see how the average ACT score and SAT score compared to the Average Cost of Attendance. In other words did those who scored higher on their standardized tests choose to go to schools that had a high cost of attendance.
While this reseach doesn’t provide a reason for why it can lead to some speculation and further research.
I hope my research was able to provide answers to the above questions and I thank anyone who made it this far for looking at my work!!!