Assignment 1: First Data Viz Using ggplot

Introduction

This dataset includes sewer outflow condition from each county in New York State with different demographic data such as educational status and estimated obesity counts. In this assignment, I firstly created two pivot tables to understand the overall picture of the dataset by chooding the variables I will be using in the plots.

Counties with different variables

I created a table with counties as the main columns and with permitted outflow and demographic dataset to see the basic differences between counties in the NY State.

df %>%
  tab_cells('Number of Permitted Outfalls' = df$`Number.of.Permitted.Outfalls`,
            'High School Grad'= df$`TotalHighSchGrad`,
            'Total College/Associated'= df$`TotalSomeCollegeOrAssociate`,
            'Bachelor or Higher'= df$`Total_BachelorOrHigher`,
            'Est Count Obese'= df$`Est_Count_Obese`) %>%
  tab_cols(County,total()) %>%
  tab_stat_median() %>% tab_pivot() %>%
  set_caption("Statisctics of Sewer Outflows per County") %>%
  
  htmlTable(.,
            css.cell = c("width: 250",
                         rep("width:70", ncol(.) - 1)))

																																					#Total
Statisctics of Sewer Outflows per County
	Albany	Bronx	Broome	Cayuga	Chautauqua	Chemung	Clinton	Cols	Columbia	Dutchess	Erie	Essex	Franklin	Greene	Herkimer	Jefferson	Kings	Madison	Monroe	Montgomery	Niagara	Oneida	Onondaga	Orange	Orleans	Oswego	Q:X	Queens	Rensselaer	Richmond	Saratoga	Ulster	Warren	Washington	Westchester	Y:AC
Number of Permitted Outfalls
Median	12	33	9	16	1	11	11		8	6	52	2	2	6	3	16	79	1	6	4	9	47	62	13	13	6		44	49	36	4	13	1	11	4		33
High School Grad
Median	10996	42371	6977	2188	3733	3087	3295		1493	9600	24840	1116	1947	1870	1766	7210	62380	2886	19677	1403	5821	5951	11059	12725	1333	3724		48475	5244	10689	4521	4882	1480	2054	21973		11059
Total College/Associated
Median	22975	70140	14368	2993	7040	3110	5716		1635	16395	42971	1056	1834	1288	2752	5773	98458	5680	40056	1609	8850	11008	26851	18665	1343	8111		96136	7965	19827	9473	9050	2423	1787	37837		26851
Total Bachelor or Higher
Median	8686	12449	3789	425	923	401	1113		584	3578	13954	277	283	205	312	858	42466	609	10717	312	2200	2432	6724	4183	164	845		31892	2365	6882	2616	1799	434	329	16302		6882
Est Count Obese
Median	135384	659701	77727	39274	62623	40132	43489		29671	134689	436528	20317	25783	23323	30622	56361	1030497	35194	363611	24712	101676	113786	217439	174116	20485	57840		932363	76996	198353	107124	90721	30605	33260	383940		217439

Discharge Activation Type

The second table I created is based on three different type of seweage outflow facility with the counts of permitted and actual events and other demographic data.

df %>%
  tab_cells('Number of Permitted Outfalls' = df$`Number.of.Permitted.Outfalls`,
            'Event Count' = df$`Event.Count`,
            'High School Grad'= df$`TotalHighSchGrad`,
            'Total College/Associated'= df$`TotalSomeCollegeOrAssociate`,
            'Bachelor or Higher'= df$`Total_BachelorOrHigher`,
            'Est Count Obese'= df$`Est_Count_Obese`) %>%
  tab_cols('Discharge Activation Type' = df$`Discharge.Activation.Type` ,total()) %>%
  tab_stat_median() %>% tab_pivot() %>%
  set_caption("Statisctics of Sewer Outflows per Discharge Activation Type") %>%
  
  htmlTable(.,
            css.cell = c("width: 250",
                         rep("width:70", ncol(.) - 1)))

				#Total
Statisctics of Sewer Outflows per Discharge Activation Type
	Model	Monitoring	Observation
Number of Permitted Outfalls
Median	44	16	12	33
Event Count
Median	46706	24287	13283	20476
High School Grad
Median	24840	12725	5821	11059
Total College/Associated
Median	42971	19827	8850	26851
Total Bachelor or Higher
Median	12449	6882	2200	6882
Est Count Obese
Median	436528	198353	101676	217439

Event count based on different types of facility

As we can see in the graph below, with the largest base, no prise that model’s event count is way more than the other two types of the facilities.

library(ggplot2)

g<-ggplot(df, aes(Discharge.Activation.Type,Event.Count))+
  geom_bar(stat = "identity", color = "pink")+
  ggtitle("Overflow Event Count per Type")+ xlab("Discharge Activation Type") + ylab("Event Count") +
  theme_grey()
g

## Warning: Removed 4 rows containing missing values (position_stack).

## Relationship between number of permitted outfalls and outfall event counts The second chart shows us the number of permitted outfalls and the actual outfall events. Although there’re some outliers with low permitted number but high outfall numbers, from the lm line we can tell that they’re basically showing a positive relationship.

g2<-ggplot(df, aes(Number.of.Permitted.Outfalls,Event.Count))+
  geom_point(alpha = "0.5", color = "red")+
  geom_smooth(method ="lm")+
  coord_cartesian() + scale_color_gradient() +
  ggtitle("Relationship between number of permitted outfalls and outfall event counts")+ xlab("Number of Permitted Outfalls") + ylab("Event Count") +
  theme_grey()
g2

## Don't know how to automatically pick scale for object of type labelled/integer. Defaulting to continuous.

## Warning: Removed 4 rows containing non-finite values (stat_smooth).

## Warning: Removed 4 rows containing missing values (geom_point).

Relationship between outfall event counts and estimated obesity population

With demographic information included, I was curious to see if those variables are big influencers to the outfalls number.

#obesity
g3<-ggplot(df, aes(Est_Count_Obese,Event.Count))+
  geom_point(alpha = "0.5", color = "green")+
  geom_smooth(method ="lm")+
  coord_cartesian() + scale_color_gradient() +
  ggtitle("Relationship between outfall event counts and estimated obesity population")+ xlab("Estimiated Obesity Population") + ylab("Event Count") +
  theme_grey()
g3 + annotate("text", x = 250000, y = 100000, label = "positive relationship")

## Don't know how to automatically pick scale for object of type labelled/integer. Defaulting to continuous.

## Warning: Removed 4 rows containing non-finite values (stat_smooth).

## Warning: Removed 4 rows containing missing values (geom_point).

#education: highschool
df$percentage_HS <- df$TotalHighSchGrad/df$PopAged18to24
g4<-ggplot(df, aes(percentage_HS,Event.Count))+
  geom_point(alpha = "0.5", color = "green")+
  geom_smooth(method ="lm")+
  coord_cartesian() + scale_color_gradient() +
  ggtitle("Hign Shcool Graduate")+ xlab("Bachelor or Higher") + ylab("Event Count") +
  theme_grey()
g4

## Don't know how to automatically pick scale for object of type labelled/integer. Defaulting to continuous.

## Warning: Removed 4 rows containing non-finite values (stat_smooth).

## Warning: Removed 4 rows containing missing values (geom_point).

#education: bachelor or higher
df$percentage_BH <- df$Total_BachelorOrHigher/df$PopAged18to24
g5<-ggplot(df, aes(percentage_BH,Event.Count))+
  geom_point(alpha = "0.5", color = "green")+
  geom_smooth(method ="lm")+
  coord_cartesian() + scale_color_gradient() +
  ggtitle("Bachelor or Higher")+ xlab("Bachelor or Higher") + ylab("Event Count") +
  theme_grey()
g5

## Don't know how to automatically pick scale for object of type labelled/integer. Defaulting to continuous.

## Warning: Removed 4 rows containing non-finite values (stat_smooth).

## Warning: Removed 4 rows containing missing values (geom_point).