library(readr)
High_Schools <- read_csv("~/Documents/High_Schools.csv")
There is a correlation between average ACT score by zip code and population. The higher the average ACT score, the higher the zip codes’ population.
In order to test my hypothesis, I decided to create a map in Tableau to show the average ACT scores by zip codes in Georgia for 2016 while comparing it to population by zip code. I am using a data set from the atlanta open data portal that has data about high schools. Atlanta Open Data . The rows of the data set are composed of the different public schools in Georgia. Most public schools in Georgia are in the data set, but not all. I will be utilizing the average high school ACT composite score and the zip code variables in the data set to conduct statistical analysis in R. Additionally, I will use the school city variable to conduct analyses in R regarding which city has the most schools from the data set as well as comparing zip codes within ciies to determine if more urban zip codes have a difference in scores compared to more rural zip codes.
The average composite ACT score for a student in the U.S. was 20.8. https://www.collegeraptor.com/getting-in/articles/act-sat/national-average-scores-for-the-act-and-sat-2016-data/ As a result, we can classify any zip code that falls below an average of 20.8 as under performing and any zip code falling above the average of 20.8 as over performing. I will generate bar charts, so I can see which zip codes are performing the lowest and highest. I will be able to determine if there is a correlation between average ACT composite score and population. I will also conduct an analysis on how zip codes within a major city such as Atlanta are performing in comparison to rural cities such as Warrenton and Washington. I will then be able to see if there is a difference in zip codes within major cities and in zip codes within small cities. Tableau only gives me population ranges for the zip codes, for my analysis, so I will often be using sources such as https://www.georgia-demographics.com to give me pretty accurate population estimates for zip codes. Additionally, it is very hard to find population estimates of zip codes for 2016 online and in tableau, so I will be using the most recent data that is available.
Here is a link to my map in Tableau:
Based on the map, it is clear that generally the larger circles are in zip codes with larger populations. However, we can see some exceptions such as 30213 which has an average ACT Composite Score of about a 17 and a population of over 18,000 people. Also we see instances such as 30622 in which they have an average ACT Composite Score of 23, but a population of less than 18,000 people. By looking at the map, we can also see that generally in zip codes with populations of less than 18,000 people, the average ACT score will be not be higher than 20. An exception to this would be 31516 which has an average ACT composite score of about 21. While larger populations in a large number of cases lead to higher average ACT scores, it does appear that income may be a better predictor.
I wrote code below to take the mean of average 2016 ACT compsote scores by zip code. I used the group_by function to filter it to just to zip codes. I was able to take the mean of the average 2016 ACT compsoite scores by using the summarize function. The filter function was used to tell R to disregard missing values for the Average 2016 ACT compsosite scores. The tail function gave me the lowest 12 averages.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ dplyr 0.8.3
## ✓ tibble 2.1.3 ✓ stringr 1.4.0
## ✓ tidyr 1.0.0 ✓ forcats 0.4.0
## ✓ purrr 0.3.3
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
High_Schools %>%
group_by(SCHOOL_ZIP) %>%
summarize(total = mean(AvgHi_ACT_Composite_Score_2016)) %>%
arrange(desc(total)) %>%
filter(total > 1) %>%
tail(12) %>%
ggplot(aes(reorder(SCHOOL_ZIP,total),total)) + geom_col() + coord_flip() +
labs(title="Top 12 GA Zip Codes in Terms of 2016 Average ACT Composite Score",
x="Zip Code",
y="2016 Average ACT Composite Score")
According to the bar chart, 31087 had the lowest mean average ACT composite score for 2016 in Georgia. The next three zip codes witht the lowest mean average ACT composite scores for 2015 were 31907, 39840, and 31044. 30187 which had an average score of about a 15 only has a population of between 5,010-18,000 people according to the map in Tableau. 31907 fits into the range of 20,100-124,00 people, with a population of about 56,000. https://www.georgia-demographics.com/31907-demographics The population of 39840 was about 5,000 which is more like what we would expect given the possibility that smaller populations can have lower average ACT scores than larger populations. https://www.georgia-demographics.com/39840-demographics 31044 has a population in the range of 1,640-5,010 according to Tableau. It appears that since one of our four lowest zip codes in terms of average ACT score has a population above 50,000 that income may be a better predictor than population. However, it does appear that a large amount of zip codes who are under performing in terms of ACT scores have small populations.
I wrote code below to generate a bar chart to show the 12 zip codes with the highest average ACT composite scores in 2016. I did this by grouping by bar chart by zip code and using the summarize function to calculate the mean for each of the zip codes. The head function was used in order to tell R to show the 12 zip codes with the highest averages.
High_Schools %>%
group_by(SCHOOL_ZIP) %>%
summarize(total = mean(AvgHi_ACT_Composite_Score_2016)) %>%
arrange(desc(total)) %>%
head(12) %>%
ggplot(aes(reorder(SCHOOL_ZIP,total),total)) + geom_col() + coord_flip() +
labs(title="Worst 12 GA Zip Codes in Terms of 2016 Average ACT Composite Score",
x="Zip Code",
y="2016 Average ACT Composite Score")
According to the bar chart, the zip codes with the 5 highest average ACT composite scores in 2016 were 30044, 30022, 30062, 30269, and 30005. According to the map in Tableau, all of the 5 highest average scores are in zip codes with at least 18,000 people. In particular, 30044 has a very high population of 88,040 people. https://www.point2homes.com/US/Neighborhood/GA/Gwinnett-County/30044-Demographics.html The population of 30269 is 35,904 which is the fifth highest zip code is quite lower than 30044. However, it is still higher than most of the populations of zip codes who are performing the worst in terms of average ACT composite scores. https://www.point2homes.com/US/Neighborhood/GA/Fayette-County/30269-Demographics.html It appears that generally the highest peroforming zip codes have populations that are greater than 18,000 people.
I wrote code below to determine which cities have the most high schools that are represented in the data set. I used the group by function to filter it to just the cities of schools. The count function was used to determine the amount of high schools from the data set that each city has. The head function was used to tell R to only print the top 12 cities in terms of number of high schools in the data set.
High_Schools %>%
group_by(SCHOOL_CITY) %>%
count(SCHOOL_CITY, sort=TRUE) %>%
head(12)
## # A tibble: 12 x 2
## # Groups: SCHOOL_CITY [12]
## SCHOOL_CITY n
## <chr> <int>
## 1 Atlanta 28
## 2 Augusta 13
## 3 Savannah 12
## 4 Columbus 10
## 5 Lawrenceville 10
## 6 Marietta 9
## 7 Macon 8
## 8 Douglasville 6
## 9 Gainesville 6
## 10 Stone Mountain 6
## 11 Conyers 5
## 12 Cumming 5
Based on the table it appears that the biggest cities in Georgia are represented the most among the schools in the data set. Atlanta is by far the biggest city with a population of about 479,000. https://www.georgia-demographics.com/atlanta-demographics In the top 12, all of the cities are either in the Atlanta metropolitan area or are amongst some of Georgia’s larger cities such as Atlanta, Augusta, and Savannah.
I wrote code below to determine which cities have the least number of high schools that are represented in the data set. The group by function was used to group my results by school city along with the count function which calculated the number of high schools in each city. The tail function was used to tell R to only show the 12 least number of high schools that are represented. However, since a good number of cities only have one high school in the data set, R by default puts the cities at the end of the alphabet as the lowest, even though they all only have one school. As a result, we have cities in the data set that are at the end of the alphabet.
High_Schools %>%
group_by(SCHOOL_CITY) %>%
count(SCHOOL_CITY, sort=TRUE) %>%
tail(12)
## # A tibble: 12 x 2
## # Groups: SCHOOL_CITY [12]
## SCHOOL_CITY n
## <chr> <int>
## 1 Twin City 1
## 2 Tyrone 1
## 3 Vidalia 1
## 4 Villa Rica 1
## 5 Warm Springs 1
## 6 Warrenton 1
## 7 Washington 1
## 8 Watkinsville 1
## 9 Waycross 1
## 10 Waynesboro 1
## 11 White 1
## 12 Wrightsville 1
Based on the table, it is clear based on my knowledge of Georgia that none of the top 10 most populated cities in the state only have 1 school represented. In particular, Twin City only has a population of about 1,800 and Tyrone has a population of about 7,000. https://www.georgia-demographics.com/twin-city-demographics https://www.georgia-demographics.com/tyrone-demographics These population estimates seem to make sense there are no big cities that only have 1 high school in the data set. While the data set does not have every public school in Georgia, the number of high schools represented for each city is pretty consistent with the city’s population numbers. For the most part, the more high schools that are represented for each city, the larger the city’s population.
I wrote code below to show a bar chart of all the zip codes in Atlanta that are represented in the data set with their corresponding ACT average score. The group_by function was used to graph average ACT scores by zip code along with the filter function which was used to only show zip codes in Atlanta. The filter function was used again to tell R to not include any zip codes in Atlanta that had high schools in which data was not reported for 2016 average ACT scores.
High_Schools %>%
group_by(SCHOOL_ZIP) %>%
filter(SCHOOL_CITY=="Atlanta") %>%
summarize(total = mean(AvgHi_ACT_Composite_Score_2016)) %>%
filter(total>1) %>%
arrange(desc(total)) %>%
ggplot(aes(reorder(SCHOOL_ZIP,total),total)) + geom_col() + coord_flip() +
labs(title="2016 Average ACT Composite Scores For Zip Codes in Atlanta",
x="Zip Code",
y="2016 Average ACT Composite Score")
According to the bar chart, 30345 has the highest 2016 average ACT composite score out of all the zip codes in Atlanta. According to the map in tableau, 30345 has a population of 18,000-123,000 people. However, 30316 which has an average ACT score of about a 17.5 has a population of about 33,000. https://www.georgia-demographics.com/30316-demographics Another example is 30314 in which they are also under performing in terms of average ACT scores. They have a population of about 24,000. https://www.georgia-demographics.com/30314-demographics Therefore, populations of zip codes do not appear to be a consistent factor with determining average ACT scores per zip code. Zip codes with populations greater than 18,000 can be under performing.
I wrote code below to create a bar chart that shows the average ACT composite score for five select cities in Georgia that all have only 1 school represented in the data set. For the five cities, there only appears to be one zip code for each of the cities. Therefore, the zip codes that are shown below are in these five cities. In order to only select five cities for the bar chart, I used the filter function to key in the five cities that I wanted to appear in the bar chart. I grouped it by zip code, so R would calculate the average ACT composite score for each of the five cities zip codes.
High_Schools %>%
group_by(SCHOOL_ZIP) %>%
filter(SCHOOL_CITY %in% c("Warrenton", "Washington", "White", "Vidalia", "Twin City" )) %>%
summarize(total = mean(AvgHi_ACT_Composite_Score_2016)) %>%
arrange(desc(total)) %>%
ggplot(aes(reorder(SCHOOL_ZIP,total),total)) + geom_col() + coord_flip() +
labs(title="2016 Average ACT Composite Scores For 5 Rural Zip Codes in Georgia",
x="Zip Code",
y="2016 Average ACT Composite Score")
According to the bar chart, all of the zip codes have an average ACT score of less than 20. Additonally, according to Tableau all of the zip codes have populations that are less than 18,000. Therefore, it appears that generally the zip codes with the lowest average ACT composite scores are the ones with populations lower than 18,000. However, this is not always the case since we saw earlier that a zip code like 30344 had a population of above 18,000.
My hypothesis that average ACT score is correlated with population appears to be true some of the time, but not always. Throughout my analysis, I came across a variety of exceptions such as 31907 which had a large population but an average ACT Composite Score of about a 17. This exception as well as several others that I discovered indicate that population cannot neccesarily predict the average ACT score for a zip code. As mentioned throughout my analysis, income may be a better predictor for average ACT score than population. I think the reason that this correlation is not always true is that median household income can not always be correlated with population. There are zip codes in Georga who have large populations, but relatively low median household income amounts. There did not appear to be much of a difference in ACT scores for zip codes in more urban areas versus zip codes in more rural areas. This was quite surprising as it is known that there urban areas tend to have more resources than rural areas. It appears as found earlier that population does not always predict average ACT score for each zip code in Georgia. Some zip codes may have higher median household income amounts which often indicates a larger population, but not always.