I read in the data from the Atlanta open data portal and my data set has 476 observations.
library(readr)
High_Schools <- read_csv("High_Schools.csv")
There is a correlation between average ACT score by zip code and population. The higher the average ACT score, the higher the zip codes’ population.
In order to test my hypothesis, I decided to create a map in Tableau to show the average ACT scores by zip codes in Georgia for 2016 while comparing it to population by zip code. I am using a data set from the atlanta open data portal that has data about high schools. Atlanta Open Data The rows of the data set are composed of the different public schools in Georgia. Most public schools in Georgia are in the data set, but not all. I will be utilizing the average high school ACT composite score and the zip code variables in the data set to conduct statistical analysis in R. Additionally, I will use the school city variable to conduct analyses in R regarding which city has the most schools from the data set as well as comparing zip codes within cities to determine if more urban zip codes have a difference in scores compared to more rural zip codes.
The average composite ACT score for a student in the U.S. was 20.8. College Raptor As a result, we can classify any zip code that falls below an average of 20.8 as under performing and any zip code falling above the average of 20.8 as over performing. I will generate bar charts, so I can see which zip codes are performing the lowest and highest. I will be able to determine if there is a correlation between average ACT composite score and population. I will also conduct an analysis on how zip codes within a major city such as Atlanta are performing in comparison to rural cities. I will then be able to see if there is a difference in zip codes within major cities and in zip codes within small cities. Tableau only gives me population ranges for the zip codes so for my analysis I will often be using sources such as Georgia Demographics to give me pretty accurate population estimates for zip codes. Additionally, it is very hard to find population estimates of zip codes for 2016 online and in tableau, so I will be using the most recent data that is available.
Based on the map, it is clear that generally the larger circles are in zip codes with larger populations. However, we can see some exceptions such as 30213 which has an average ACT Composite Score of about a 17 and a population of over 18,000 people. Also we see instances such as 30622 in which they have an average ACT Composite Score of 23, but a population of less than 18,000 people. By looking at the map, we can also see that generally in zip codes with populations of less than 18,000 people, the average ACT score will not be higher than 20. An exception to this would be 31516 which has an average ACT composite score of about 21. While larger populations in a large number of cases lead to higher average ACT scores, it does appear that income may be a better predictor.
I wrote code below to take the mean of average 2016 ACT composite scores by zip code. I used the group_by function to filter it to just to zip codes. I was able to take the mean of the average 2016 ACT compsoite scores by using the summarize function. The filter function was used to tell R to disregard missing values for the average 2016 ACT compsosite scores. The tail function gave me the lowest 12 averages.
library(tidyverse)
High_Schools %>%
group_by(SCHOOL_ZIP) %>%
summarize(total = mean(AvgHi_ACT_Composite_Score_2016)) %>%
arrange(desc(total)) %>%
filter(total > 1) %>%
tail(12) %>%
ggplot(aes(reorder(SCHOOL_ZIP,total),total)) + geom_col() +
labs(title="Bottom 12 GA Zip Codes in Terms of 2016 Average ACT Composite Score",
x="Zip Code",
y="2016 Average ACT Composite Score")
According to 30043 Population , 30043 has the highest population of 90,675 while 30664 has the lowest population of just 28 people. These wide ranges in population totals give us a good idea about how wide the ranges in Georgia can be between urban and rural zip codes.
According to the bar chart, 31087 had the lowest mean average ACT composite score for 2016 in Georgia. The next three zip codes with the lowest mean average ACT composite scores for 2016 were 31907, 39840, and 31044. 30187 which had an average score of about a 15 only has a population of between 5,010-18,000 people according to the map in Tableau. 31907 fits into the range of 20,100-124,00 people, with a population of about 56,000. 31907 population The population of 39840 was about 5,000 which is more like what we would expect given the possibility that smaller populations can have lower average ACT scores than larger populations. 39840 population 31044 has a population in the range of 1,640-5,010 according to Tableau. It appears that since one of our four lowest zip codes in terms of average ACT score has a population above 50,000 that income may be a better predictor than population. However, it does appear that a large amount of zip codes who are under performing in terms of ACT scores have small populations.
I wrote code below to generate a bar chart to show the 12 zip codes with the highest average ACT composite scores in 2016. I did this by grouping the bar chart by zip code and then using the summarize function to calculate the mean for each of the zip codes. The head function was used in order to tell R to show the 12 zip codes with the highest averages.
High_Schools %>%
group_by(SCHOOL_ZIP) %>%
summarize(total = mean(AvgHi_ACT_Composite_Score_2016)) %>%
arrange(desc(total)) %>%
head(12) %>%
ggplot(aes(reorder(SCHOOL_ZIP,total),total)) + geom_col() + coord_flip() +
labs(title="Top 12 GA Zip Codes in Terms of 2016 Average ACT Composite Score",
x="Zip Code",
y="2016 Average ACT Composite Score")
According to the bar chart, the zip codes with the 5 highest average ACT composite scores in 2016 were 30044, 30022, 30062, 30269, and 30005. According to the map in Tableau, all of the 5 highest average scores are in zip codes with at least 18,000 people. In particular, 30044 has a very high population of 88,040 people. 30044 population The population of 30269 is 35,904 which is the fifth highest zip code is quite lower than 30044. However, it is still higher than most of the populations of zip codes who are performing the worst in terms of average ACT composite scores. 30269 population It appears that generally the highest performing zip codes have populations that are greater than 18,000 people.
I wrote code below to determine which cities have the most high schools that are represented in the data set. I used the group by function to filter it to just the cities of the schools. The count function was used to determine the amount of high schools from the data set that each city has. The head function was used to tell R to only print the top 12 cities in terms of number of high schools in the data set.
High_Schools %>%
group_by(SCHOOL_CITY) %>%
count(SCHOOL_CITY, sort=TRUE) %>%
head(12)
## # A tibble: 12 x 2
## # Groups: SCHOOL_CITY [12]
## SCHOOL_CITY n
## <chr> <int>
## 1 Atlanta 28
## 2 Augusta 13
## 3 Savannah 12
## 4 Columbus 10
## 5 Lawrenceville 10
## 6 Marietta 9
## 7 Macon 8
## 8 Douglasville 6
## 9 Gainesville 6
## 10 Stone Mountain 6
## 11 Conyers 5
## 12 Cumming 5
Based on the table it appears that the biggest cities in Georgia are represented the most among the schools in the data set. Atlanta is by far the biggest city with a population of about 479,000. Atlanta population I will conduct an analysis later on Atlanta’s zip codes since it is the state’s largest city with a population of about 479,000 according to Atlanta population.
I wrote code below to determine which cities have the least number of high schools that are represented in the data set. The group by function was used to group my results by school city along with the count function which calculated the number of high schools in each city. The filter function was used to tell R to only list cities that have 1 school in the data set.
High_Schools %>%
group_by(SCHOOL_CITY) %>%
count(SCHOOL_CITY, sort=TRUE) %>%
filter(n==1)
## # A tibble: 160 x 2
## # Groups: SCHOOL_CITY [160]
## SCHOOL_CITY n
## <chr> <int>
## 1 Acworth 1
## 2 Adairsville 1
## 3 Adel 1
## 4 Alamo 1
## 5 Alma 1
## 6 Austell 1
## 7 Avondale Estates 1
## 8 Baconton 1
## 9 Barnesville 1
## 10 Baxley 1
## # … with 150 more rows
Based on the table, we can see that there are 160 cities in the data set that only have 1 school. Most of these cities such as Baxley and Bremen appear to be in small towns. Baxley’s population is about 4,697 and Bremen’s is 6,311. Georgia Demographics However, there are some larger cities that are in this list of 160 such as Mableton. Mableton has a population of 40,464 which is the 21st largest city in Georgia according to Mableton Population. While many of these cities appear to be small towns, there are some that are not which probably is due to a lack of reporting data since we would expect schools with higher populations to have more schools in the data set. Generally, the more high schools a city has in this data set,the higher population the city has, but there are some cases as mentioned where this is not true.
I wrote code below to show a bar chart of all the zip codes in Atlanta that are represented in the data set with their corresponding ACT average score. The group_by function was used to graph average ACT scores by zip code along with the filter function which was used to only show zip codes in Atlanta. The filter function was used again to tell R to not include any zip codes in Atlanta that had high schools in which data was not reported for 2016 average ACT scores.
High_Schools %>%
group_by(SCHOOL_ZIP) %>%
filter(SCHOOL_CITY=="Atlanta") %>%
summarize(total = mean(AvgHi_ACT_Composite_Score_2016)) %>%
filter(total>1) %>%
arrange(desc(total)) %>%
ggplot(aes(reorder(SCHOOL_ZIP,total),total)) + geom_col() + coord_flip() +
labs(title="2016 Average ACT Composite Scores For Zip Codes in Atlanta",
x="Zip Code",
y="2016 Average ACT Composite Score")
According to the bar chart, 30345 has the highest 2016 average ACT composite score out of all the zip codes in Atlanta. According to the map in tableau, 30345 has a population of 18,000-123,000 people. However, 30316 which has an average ACT score of about a 17.5 has a population of about 33,000. 30316 population Another example is 30314 in which they are also under performing in terms of average ACT scores. They have a population of about 24,000. 30314 population Therefore, populations of zip codes do not appear to be a consistent factor with determining average ACT scores per zip code. Zip codes with populations greater than 18,000 can be under performing.
I wrote code below to create a table that shows cities in Georgia with the lowest average ACT composite scores. I used the group by function to sort my data by school city and the drop na function was used to eliminate missing values from schools in the data set. The summarize function was used to average the ACT composite scores up for each city. Additionally, the count function was used to create a table that would show the average ACT composite scores for each city in Georgia. The arrange function was used to show the lowest ACT composite scores by city first.
High_Schools %>%
group_by(SCHOOL_CITY) %>%
drop_na(AvgHi_ACT_Composite_Score_2016) %>%
summarize(total=mean(AvgHi_ACT_Composite_Score_2016)) %>%
count(SCHOOL_CITY, total, sort = TRUE) %>%
filter(n==1) %>%
arrange(total)
## # A tibble: 226 x 3
## SCHOOL_CITY total n
## <chr> <dbl> <int>
## 1 Sparta 15.1 1
## 2 Cuthbert 15.5 1
## 3 Jeffersonville 15.6 1
## 4 Talbotton 15.6 1
## 5 Greenville 15.8 1
## 6 Manchester 15.9 1
## 7 Dawson 16 1
## 8 Warrenton 16 1
## 9 Ashburn 16.2 1
## 10 Clarkston 16.2 1
## # … with 216 more rows
According to the table, the 10 cities with the lowest average ACT composite scores amongst its schools in the data set all have an ACT compoosite score of less than 17. The 10 lowest scores appear to all be in small towns suggesting that there might be a trend in which zip codes that perform the worst on the ACT have the smallest populations. I will pipe the 10 lowest cities into a bar chart by zip code and look at Tableau for the population ranges for each zip code.
I wrote code below to show the zip codes within the 10 cities with the lowest average ACT scores in order to compare their populations to my map in tableau. I used the group by function in order to sort the data by zip and the filter function told R to only include the 10 school cities specified. The summarize function took the mean of the average composite scores for each zip code within the 10 cities. The filter function was used again to tell R to eliminate any missing values. Furthermore, the ggplot function was used to create a bar chart.
High_Schools %>%
group_by(SCHOOL_ZIP) %>%
filter(SCHOOL_CITY %in% c("Sparta", "Cuthbert", "Jeffersonville", "Talbotton", "Greenville", "Manchester", "Dawson", "Warrenton", "Ashburn", "Clarkston")) %>%
summarize(total=mean(AvgHi_ACT_Composite_Score_2016)) %>%
arrange(total) %>%
filter(total>1) %>%
ggplot(aes(reorder(SCHOOL_ZIP,total),total)) + geom_col() + coord_flip() +
labs(title="2016 Average Scores For Zip Codes in Cities with the Lowest Scores",
x="Zip Code",
y="2016 Average ACT Composite Score")
According to the bar chart, these are the zip codes within the 10 cities that have the lowest average ACT composite scores. Not all zip codes within the 10 cities are included since there were a couple of cities with missing values for the 2016 ACT composite score variable. Furtherome, according to the map in tableau none of the zip codes in the bar chart have a population of greater than 18,000. Most of these zip codes had 5,010 people or less. Therefore, it appears that cities with the lowest ACT scores tend to have low populations. However, as we start increasing the average ACT score we will start to see some cities with large populations which is likely due to income. When comparing zip codes within these 10 cities to the zip codes within the city of Atlanta, it does not appear that there is a substantial difference in performance since some zip codes in Atlanta are also under performing. Again, it most likely goes back to the idea of median household income.
My hypothesis that average ACT score is correlated with population appears to be true some of the time, but not always. Throughout my analysis, I came across a variety of exceptions such as 31907 which had a large population 31907 population , but an average ACT Composite Score of about a 17. This exception as well as several others that I discovered indicate that population cannot neccesarily predict the average ACT score for a zip code. As mentioned throughout my analysis, income may be a better predictor for average ACT score than population. I think the reason that this correlation is not always true is that median household income cannot always be correlated with population. There are zip codes in Georga who have large populations, but relatively low median household income amounts. There did not appear to be much of a difference in ACT scores for zip codes in more urban areas versus zip codes in more rural areas. This was quite surprising as it is known that urban areas tend to have more resources than rural areas. It appears as found earlier that population does not always predict average ACT score for each zip code in Georgia. Some zip codes may have higher median household income amounts which often indicates a larger population, but not always.