library(tidyverse)
setwd("C:/Users/chank/OneDrive - University of Illinois Chicago/PA 343 - Fusi/Week 9/Assignment")
data1 = read_csv("data1.csv")
data = read_csv("data.csv")
Each row/unit of observation represents a school and three characteristics whose values are unique to that school.
The three characteristics are:
data1 = pivot_wider(data,
id_cols = c("uid", "community_schooltype", "county", "api"),
names_from = "variable_name",
values_from ="percentage") %>%
separate(community_schooltype, into = c("community", "schooltype"))
data1 %>%
ggplot() +
geom_density(mapping = aes(x = api)) +
labs(title = "Distribution of the Academic Performance Index",
x ="Academic Performance Index - API")
summary(data1$api)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 336.8 593.6 684.1 680.9 770.1 958.5
The plot that is created is bell-shaped and consistent with the values that were received from the “summary” function
data1 %>%
ggplot() +
geom_density(mapping = aes(x = meals)) +
geom_vline(xintercept = mean(data1$meals)) +
geom_vline(xintercept = median(data1$meals)) +
labs(title = "Distribution of % of Students Eligible for Subsidized Meals",
x = "% of Students Eligible for Subsidized Meals")
summary(data1$meals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00659 25.14071 49.40577 49.75635 74.44822 99.98998
The plot that is created is not bell-shaped like the last one; however, is still concurrent with the values received from “summary”
data1 %>%
ggplot() +
geom_density(mapping = aes(x = colgrad)) +
labs(title = "Distribution of % of Parents with a College Degree",
x = "% of Parents with College Degrees")
summary(data1$colgrad)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06105 9.71679 18.12907 20.84864 29.44429 86.48472
There is an initial rise at around 12.5% of parents with college degress; however, it begins to decrease substantially as the % of parents with degrees increases
data1 %>%
ggplot() +
geom_density(mapping = aes(x =fullqual)) +
geom_vline(xintercept = mean(data1$fullqual),
color = "green") +
geom_vline(xintercept = median(data1$fullqual), color = "red") +
labs(title = "Distribution of % of Fully Qualified Teachers",
x = "% of Fully Qualified Teachers")
summary(data1$fullqual)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.12 81.61 91.82 87.51 97.48 100.00
The plot depicts an exponential curve upwards
data1 %>%
ggplot() +
geom_bar(mapping = aes(x = community)) +
labs(title = "Number of Schools by Community Type",
x = "Community Type: Urban, Suburban or Rural",
y = "Number of Schools")
This bar plot allows the interpreter to see that there is a large number of schools in Suburban neighborhoods. The number of schools in urban neighborhoods is about half that previous number (suburban schools). The number of schools in rural neighborhoods is about half that of urban (or 25% of suburban)
data1 %>%
ggplot() +
geom_bar(mapping = aes(x = schooltype)) +
labs(title = "Number of Schools by Type",
x = "School Type: Elementary, Middle or High School",
y = "Number of SChools")
This plot shows that a great majority of schools are elementary schools by a great margin
data1 %>%
ggplot() +
geom_point(mapping = aes(x = colgrad,
y = api)) +
labs(title = "Correlation between the API and Parents' Education",
x = "% of School Parents with College Degrees",
y = "Academic Performance Index - API")
cor(data1$api, data1$colgrad)
## [1] 0.6894666
Based on this plot, as the percentage of parents with college degrees increases, so does the academic performance index of each school
school_type = c(
E = "Elementary School",
M = "Middle School",
H = "High School")
data1 %>%
ggplot() +
geom_point(mapping = aes(x = colgrad,
y = api,
color = community)) +
labs(title = "Correlation Between the API and Parents' Education",
x = "% of School Parents with College Degrees",
y ="Academic Performance Index - API") +
facet_wrap(~ schooltype, labeller = labeller(schooltype = school_type))
By breaking it down by schooltypes (E, M, H), you are able to see that the distribution’s are similar across the three types; however, the plot for elementary schools (E) are much more congested, as one would expect based on the previous plots
data1 %>%
arrange(desc(api)) %>%
mutate(ranking = seq(1, 10000)) %>%
filter(ranking <= 10 | ranking >= 9991) %>%
mutate(position = c(rep("top", 10),
rep("bottom", 10))) %>%
ggplot() +
geom_bar(mapping = aes(x = api,
y = as.character(uid),
fill = position),
stat = 'identity') +
labs(title = "The Gap Between Top and Lowest Performers",
subtitle = "Top and Bottom 10 Schools by API",
x = "Academic Performance Index - API",
y = "School Unique Identifier") +
geom_vline(xintercept = mean(data1$api), color = "red")
With the top 10 highest schools depicted in clue on the plot and top 10 lowest schools on red, you are able to see how they compare against the average (mean) across all schools included in the data set
data1 %>%
ggplot() +
geom_point(mapping = aes(x = fullqual,
y = api))
As the percentage of fully qualified teachers in schools increases, so does the academic performance of students in that school