R Markdown

Step 1.1: Provide an Introduction that explains the purpose of the document

#The purpose of this document is to display and analyze information in comparing how institutions of post-secondary education in Ohio compare to other schools regionally and nationally. 

#Step 1.2: Provide a short explanation of the data used.

#The data that I will be using is data from the US Department of Education. Part of this data is collected and reported by the department of education and includes includes institutions of post-secondary education. 

#Step 1.3: Explain how your analysis will help the individual better understand how institutions of post-secondary education in Ohio compare to other schools regionally and nationally

## My explanation will show how institutions of post-secondary education in Ohio compare to other schools regionally and nationally by addressing several variables such as household income, the type of University, the cost of attending schools. Comparing these kind of variables and others will allow me to identify what factors go into the comparison between post-secondary schools in Ohio to other schools that are regionally and nationally. 

#2.1: All Packages are identified and loaded upfront so the reader knows which are required to replicate the analysis.

## -- Attaching packages ----------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts -------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Step 2.3: Explanation of the purpose of each package

I loaded the tidyverse package in order to make it easy to install and load packages within a single step

I loaded the dplyr package in order to provide a flexible grammar of data manipulation. Here, it helps me identify the most important data manipulation verbs and make them easy to use in R.

I loaded the skmir package in order to take data frames, return data frames and working as a pipeline. It also provides a strong set of summary statistics that are generated for this kind of data set I will be using.

I loaded the stringr package because it helps with functions deal with "NA’’s and zero length vectors in the same way, and the output from one function is easy to feed into the input of another.

I loaded the ggplot2 package in order to map variables to aesthetics, create graphs for the data I will be using, and what graphical primitives to use. It also helps take care of the details inputting variables into the graphs.

I loaded the knitr package in order to sweave with as more flexible design and features like caching and finer control of graphics. It is also used for dynamic report generation R, specifically here to use with College data for this Exam.

3.1:Import the Data

3.2: Data Conformity

#Step 3.2: Dummy Variables

##3.3: Source data is thorougly explained What this graph is showing

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##    321.4  22668.0  31447.5  38482.7  48098.7 174263.2      500
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     564    1045    1117    1132    1195    1558    5795
##    Length     Class      Mode 
##      7115 character character
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   11.00   12.00   21.00   19.81   22.00   43.00     451
##  num [1:7115] 100654 100663 100690 100706 100724 ...

##3.3: Source Data is thoroughly explained:

The analysis I am conducting will allow students and parents to have a better understanding of how institutions of post-secondary education in Ohio comapres to other schools nationally. This will help students and parents determine how interested they are in going to a school in Ohio or outside Ohio.

The Metrics I am looking at Schools ID, Institution, City, State Costcode, Locale of institution, Lattitude and Longitutde of School, if the school was historically black college or university, if it was a female or male only college or university, if the schools admission rate was above or below 50%, if the student population of the school was either below or above at least 7,000 students, Average SAT score of a student admitted into a certain school, percentage of undergrads who received a pell grant,perceentage of undergrads receiving a federal student loan, average family income in real U.S. Dollars in the year of 2015, and number of first generation students at a University or College.

3.4: Summarize the results into a clean data table (DT) that can be easily manipulated by a web user to search, sort, and filter the data.

#3.5: Provide summary information about variables of concern in your clean data set.

For this Exam, some of the variables that I had a summary were for the average Family Income of a unviersity, the average SAT score of a university, the zipcode of the univeristy, and the type of class that a typical student went to that University. Some summary information for the average family income of a college/university was that the average family had an income of a student who attended a unviersity/college in this dataset was about $38,482.7. The typical average SAT score of a student that attended a college/university was 1132, while the average location of an institution was between a small city and large suburban town.

Visualization 4.1: Show the number of institutions in Ohio and in each state that borders Ohio

#Vizualization 4.2: Illustrate how cost for attendance varies by family income for all institutions

#Visualization 4.3: Compare the number of undergraduates across each of the 3 institutional control types for all institutions

Visualization #4.4: Show a relationship between ACT or SAT scores and family income across each of the states that border the state of Ohio

#5.1 Visualization: Do you find support for old adage: “Private schools cost more than public schools” Explain.

The approach I used for this problem was to create a boxplot to dettermine if private schools cost more than public schools. In the process of this approach, I first filtered by the Control Varialbe, so the boxplot would be specifically analyzing between private and public schools. 1 is Public Schools, 2 is Private for non-profit schools, and 3 is Private for-profit schools. After making my visualization, I found out that Private for non-profit schools were in the upper quartile group in terms of Cost of Attendance. I was able to determine that private schools cost more than public schools because there is a grater variability in cost of attendance for private schools than public schools.

#Table 5.2: How does the average family income of students at Xavier University compare nationally? Within Schools of Ohio?

The approach I took to determine how the average family income of students at Xavier University compare nationally was to first filter out the data so it would be specifically looking at the institution and family income variable. I also filtered by Xavir University in order to compare the average family income at Xavier University compared to schools nationally. Next, I took the average income between schools nationally in order to find out the average family income of Xavier University and how other schools compared. I used the function kable in oder to generate a simple table that is designed so others can easily read the data to interpret it. For my results of this table, it was determined that the average family income of students at Xavier University (114,329.60) was greater than the average family income of universities nationwide (38,482.72). Additionally, when comparing the average family income of students Xavier University compared within schools of Ohio, Xavier students still had a higher average family income (114,329.60) compared to schools within Ohio (42,379.96).

Xavier University vs National Average
INSTNM FAMINC
Xavier University 114329.6
Xavier University vs. National Average Family Income
avg_Income
National Average 38482.72
Xavier University 114329.60
National Avg. FAMINC
Average Family Income
38482.72
Average Family Income of Schools in Ohio
42379.96

#Visualization 5.3: How does the cost of attending an Ohio University compare to universities in states that border Ohio? What about universities nationally, not considering state?

My approach for this question was to create a bar chart in order to visually display on how the cost of attending an Ohio University compares to universities in states that border Ohio. I had to first filter each state that borders Ohio using the variable STABBR. Then I took the mean of cost of attendance variable so when I made my bar chart, the y-axis would be average cost of attendance of each university in each state. Following the creation of my bar chart, I was able to determine that the cost of attending a school at an Ohio University is similiar to most of the states that border Ohio. However, the 2 data points that stood out was univerisites in Pennsylvania being the highest average cost of attendance between schools within Ohio and schools in states that border Ohio, and Indiana being the second highest average cost of attendance.

For cost of attending an Ohio University compared to univerisites nationally, the cost was just slightly higher to attend a university in Ohio compared to attending any other university.

#Visualization 5.4: What schools have the highest and lowest percentage of undergraduate students receiving a Pell Grant

The approach I took for this problem was creating a table between the schools that had the highest and lowest percentage of undergraduate students receiving a Pell Grant. I first made a table for schools that have the highest percentage of undergrad students receiving a Pell Grant and lowest percent of undergrad students receiving a Pell Grant with 1 representing as the highest for schools who have the highest percent of undergrad students receiving a Pell Grant, and then a 0, which is represented as the lowest percent, for schools that have the lowest percent of undergrad students receiving a Pell Grant. I took the top 10 highest and lowest percent of schools whose undergradudate students received a Pell Grant. For example, MTI Business College, Mr. Bela’s School of Cosmetology, and Southern School of Beauty were among the schools who have the highest percent of undergrad students receiving a Pell Grant, while Bais Binyomin Academy and United States Coast Guard Academy were among the schools who have the lowest percent of undergrad students receiving a Pell Grant.

INSTNM PCTPELL
MTI Business College Inc 1
Mr Bela’s School of Cosmetology Inc 1
Southern School of Beauty Inc 1
Victoria Beauty College Inc 1
Central School of Practical Nursing 1
Virginia School of Hair Design 1
Instituto de Educacion Tecnica Ocupacional La Reine-Manati 1
Colegio Mayor de Tecnologia Inc 1
Liceo de Arte-Dise-O y Comercio 1
Nouvelle Institute 1
INSTNM PCTPELL
Bais Binyomin Academy 0
United States Coast Guard Academy 0
American Islamic College 0
Principia College 0
The Southern Baptist Theological Seminary 0
New Orleans Baptist Theological Seminary 0
United States Naval Academy 0
MGH Institute of Health Professions 0
Saint John’s Seminary 0
Hillsdale College 0

6.1

#6.2

Average Student Population in Urban vs. Rural Schools
Count Family Income First Gen Dependent Married Female Prop. Age Entry % FLOAN % Pell
Urban 1570 37103.87 0.4625891 0.4538432 0.1652142 0.6533333 26.56301 0.5064058 0.4964127
Rural 62 32992.44 0.4334615 0.5553846 0.1518182 0.5854717 24.75911 0.2707018 0.4519298

Bonus

There is statistical evidence that the percentage of students who receive federal loans who attend schools in an urban city is greater than the percentage of students who attend schools in a remote rural setting.

## 
##  Welch Two Sample t-test
## 
## data:  urban$PCTFLOAN and rural$PCTFLOAN
## t = 5.7733, df = 59.765, p-value = 1.48e-07
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.1674934       Inf
## sample estimates:
## mean of x mean of y 
## 0.5064058 0.2707018

6.3 - Question 1: Do students who perform well on the SAT tend to attend schools with larger costs of attendance?

I find this question to be interesting because this could be a deciding factor if students who think they performed well on the SAT select a school that costs more than students who score lower on the SAT. I intend to answer this question using a scatter graph and I will be using the ggplot package to display my findings. The 2 variables I will be using is the Average SAT Score and the Average Cost of Attendance at a University.

## 
## Call:
## lm(formula = COSTT4_A ~ SAT_AVG, data = CollegeScoreCard)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -35723 -11413   1585  10667  34862 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -35056.963   3086.407  -11.36   <2e-16 ***
## SAT_AVG         62.418      2.711   23.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12740 on 1306 degrees of freedom
##   (5807 observations deleted due to missingness)
## Multiple R-squared:  0.2887, Adjusted R-squared:  0.2882 
## F-statistic: 530.2 on 1 and 1306 DF,  p-value: < 2.2e-16

The analysis I performed for this question was to find out if students who perform well on the SAT tend to attend schools with larger costs of attendance. Through this analysis, I made a scatter chart and added a trendline in order to appropriately visualize my findings. I used the variables Average SAT score of a student and average cost of a attendance at university to find my results. As a result, I was able to find out there is some correlation that if a student scores a higher SAT Score, they can potentially attend a more expensive university. Though there is a huge cluster between the average SAT Score of 1000 and 1250, this graph does a good job indicating that a students average SAT score can result in a student attending a university that has a higher cost of attendance. Additionally, the type of analysis I would ponder this question beyond this scatter graph is using a T-Test. A t-test would help me find variance in my results and find out in detail if the two groups have a good relationship or not. It can also help me find out if there is a signifcant statistically difference between the two variables.

#6.3 - Question 2: How does Alabama and Tennessee compare in terms of the percent of students receiving a federal loan by age of entry?

I find this question to be interesting because I would like to see if there is a correlation between two southern states (Alabama and Tennesee) on the percent of students who receive a federal loan by age type. This is interesting to me because I want to see if the percent amount of federal loan you recieve is more, the older you get, or if the percent amount of federal loan you recieve is lower, the younger you are. I intend to answer this question using a scatter graph and I will be using the ggplot package to display my findings. The 2 variables I will be using is the Percent of undergrad students receiving a federal loan and the average age of entry at a University. Additionally, I will also be using a group by in order to differentiate the type of states I will be visualizing in graph and what I will be pulling from the data table, College Score Card.

STABBR count mean_pctfloan sd_pctfloan mean_age_entry sd_age_entry
AL 94 0.4793023 0.307291 25.42228 4.063865
TN 176 0.4557407 0.313186 25.87043 3.377343
## 
##  Welch Two Sample t-test
## 
## data:  alabama$PCTFLOAN and tennessee$PCTFLOAN
## t = 0.57087, df = 176.29, p-value = 0.5688
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.05789123  0.10501440
## sample estimates:
## mean of x mean of y 
## 0.4793023 0.4557407
## 
##  Welch Two Sample t-test
## 
## data:  alabama$AGE_ENTRY and tennessee$AGE_ENTRY
## t = -0.89566, df = 162.6, p-value = 0.3718
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.4361901  0.5398857
## sample estimates:
## mean of x mean of y 
##  25.42228  25.87043

As a result of my findings, it is determined that there is no real correlation on age of entry impacting the percent of undergrad students receiving a federal loan from a university in either Alabama or Tennessee. I noticed from this data though that schools in Tennessee seem to have a little more percent of undergrads who are more likely to receive a federal loan than in schools in Alabama, outside of the one major outlier Alabama has. This helps me answer that there is somewhat of a correlation between age of entry and the percent of undergrads who receive a federal loan from either universities in Alabama or Tennessee. The statistical analysis I would perform for this question would be a t-test. Using a t-test, I would take use Alabama and Tennessee as my vectors of data to compare for my t-test. Additionally, I would look to see if there is a statistical significant difference between Alabama and Tennessee, in terms of the percent of undergrads who receive a federal loan by age of entry. I would also load a table and find the mean and standard deviation between the two states in order to compare the percent of federal loans recived by age of entry. Similiar to my scatter graph, I would think there be a decent chance for the 2 states to be statistically significant difference because I think there p-values might be greater than 0.5.