Mid-west States Demographics and Diversity Analysis

Table of Contents / Index

1.Introduction
     1.1 Sources
     1.2 Techniques and R libraries
2.Data Exploration
     2.1 Getting data from github
     2.2 Data profiling using summary and hmist libraries
     2.3 Data Observation/Conclusion
3.Data Wrangling
     3.1 Data Observation/Conclusion
4.Data Visualization
     4.1 Area coverage analysis on different States
     4.2 Diversity of population
     4.3 Poverty vs Education
     4.4 Total Population vs Poverty population
     4.5 Below Poverty level on States
5.Conclusion

1.Introduction

   Mid-west States in USA, the orginal dataset contains demographics and poverty data for 5 mid-west states in USA, In our analysis we are focusing on area, education, population,diversity and how it has impact on poverty level

1.1 Sources

   Lookup file contains USA state name and abbrevation from github to perform sql join https://raw.githubusercontent.com/thasleem1/midwest/master/states_lkp.csv

1.2 Techniques and R libraries

   Fetching data from github using RCurl library
   Performed and captured the summary and statistics of mid west states using summary and Hmisc libraries
   Data profiling and its observation
   Subsetting, aggregation of data using sqldf library
   Modifying/Added the new attribute from lookup file (StateName)
   ggplot2 and reshape2 libraries used for data visualization
   corrplot for making correlation plot

2.Data Exploration

   Performing data summary and data profiling to understand the data

2.1 Getting data from github

library(RCurl)
## Loading required package: bitops
#Getting data from github repository
#midwest -> dataset with all data of midwest demographics population statistics
midwest <-read.csv(file = 'C:/Users/aisha/Dropbox/CUNY/CUNY_BridgeClass/midwest/midwest.csv', header = TRUE)




#states_lkp -> dataset having states and its abbrevation
states_lkp <-read.csv(file = 'C:/Users/aisha/Dropbox/CUNY/CUNY_BridgeClass/midwest/states_lkp.csv')
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
#### SQL join with lookup table to make a master dataset
#Performing left outer join to get the StateName from state_lkp.csv file
midwest_states <- sqldf("SELECT midwest.*,states_lkp.StateName from midwest
                             LEFT JOIN states_lkp on midwest.State=states_lkp.StateCode")

2.2 Data profiling using summary and hmist libraries

summary(midwest_states)
##        X            PID            county    state         area        
##  Min.   :  1   Min.   : 561   CRAWFORD:  5   IL:102   Min.   :0.00500  
##  1st Qu.:110   1st Qu.: 670   JACKSON :  5   IN: 92   1st Qu.:0.02400  
##  Median :219   Median :1221   MONROE  :  5   MI: 83   Median :0.03000  
##  Mean   :219   Mean   :1437   ADAMS   :  4   OH: 88   Mean   :0.03317  
##  3rd Qu.:328   3rd Qu.:2059   BROWN   :  4   WI: 72   3rd Qu.:0.03800  
##  Max.   :437   Max.   :3052   CLARK   :  4            Max.   :0.11000  
##                               (Other) :410                             
##     poptotal         popdensity          popwhite          popblack      
##  Min.   :   1701   Min.   :   85.05   Min.   :    416   Min.   :      0  
##  1st Qu.:  18840   1st Qu.:  622.41   1st Qu.:  18630   1st Qu.:     29  
##  Median :  35324   Median : 1156.21   Median :  34471   Median :    201  
##  Mean   :  96130   Mean   : 3097.74   Mean   :  81840   Mean   :  11024  
##  3rd Qu.:  75651   3rd Qu.: 2330.00   3rd Qu.:  72968   3rd Qu.:   1291  
##  Max.   :5105067   Max.   :88018.40   Max.   :3204947   Max.   :1317147  
##                                                                          
##  popamerindian        popasian         popother        percwhite    
##  Min.   :    4.0   Min.   :     0   Min.   :     0   Min.   :10.69  
##  1st Qu.:   44.0   1st Qu.:    35   1st Qu.:    20   1st Qu.:94.89  
##  Median :   94.0   Median :   102   Median :    66   Median :98.03  
##  Mean   :  343.1   Mean   :  1310   Mean   :  1613   Mean   :95.56  
##  3rd Qu.:  288.0   3rd Qu.:   401   3rd Qu.:   345   3rd Qu.:99.07  
##  Max.   :10289.0   Max.   :188565   Max.   :384119   Max.   :99.82  
##                                                                     
##    percblack       percamerindan        percasian        percother      
##  Min.   : 0.0000   Min.   : 0.05623   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.: 0.1157   1st Qu.: 0.15793   1st Qu.:0.1737   1st Qu.:0.09102  
##  Median : 0.5390   Median : 0.21502   Median :0.2972   Median :0.17844  
##  Mean   : 2.6763   Mean   : 0.79894   Mean   :0.4872   Mean   :0.47906  
##  3rd Qu.: 2.6014   3rd Qu.: 0.38362   3rd Qu.:0.5212   3rd Qu.:0.48050  
##  Max.   :40.2100   Max.   :89.17738   Max.   :5.0705   Max.   :7.52427  
##                                                                         
##    popadults          perchsd        percollege        percprof       
##  Min.   :   1287   Min.   :46.91   Min.   : 7.336   Min.   :-184.130  
##  1st Qu.:  12271   1st Qu.:71.33   1st Qu.:14.114   1st Qu.:   2.998  
##  Median :  22188   Median :74.25   Median :16.798   Median :   3.814  
##  Mean   :  60973   Mean   :73.97   Mean   :18.273   Mean   :   4.020  
##  3rd Qu.:  47541   3rd Qu.:77.20   3rd Qu.:20.550   3rd Qu.:   4.949  
##  Max.   :3291995   Max.   :88.90   Max.   :48.079   Max.   :  20.791  
##                                                                       
##  poppovertyknown   percpovertyknown percbelowpoverty percchildbelowpovert
##  Min.   :   1696   Min.   :80.90    Min.   : 2.180   Min.   : 1.919      
##  1st Qu.:  18364   1st Qu.:96.89    1st Qu.: 9.199   1st Qu.:11.624      
##  Median :  33788   Median :98.17    Median :11.822   Median :15.270      
##  Mean   :  93642   Mean   :97.11    Mean   :12.511   Mean   :16.447      
##  3rd Qu.:  72840   3rd Qu.:98.60    3rd Qu.:15.133   3rd Qu.:20.352      
##  Max.   :5023523   Max.   :99.86    Max.   :48.691   Max.   :64.308      
##                                                                          
##  percadultpoverty percelderlypoverty    inmetro          category  
##  Min.   : 1.938   Min.   : 3.547     Min.   :0.0000   AAR    :193  
##  1st Qu.: 7.668   1st Qu.: 8.912     1st Qu.:0.0000   AAU    : 77  
##  Median :10.008   Median :10.869     Median :0.0000   LAR    : 30  
##  Mean   :10.919   Mean   :11.389     Mean   :0.3432   LHR    : 28  
##  3rd Qu.:13.182   3rd Qu.:13.412     3rd Qu.:1.0000   HLU    : 26  
##  Max.   :43.312   Max.   :31.162     Max.   :1.0000   ALU    : 20  
##                                                       (Other): 63  
##      StateName  
##  Illinois :102  
##  Indiana  : 92  
##  Ohio     : 88  
##  Michigan : 83  
##  Wisconsin: 72  
##  Alabama  :  0  
##  (Other)  :  0
podensity_mean <- mean(midwest_states$popdensity) #Mean of population density
print(paste0("Mean of Population density ",podensity_mean))
## [1] "Mean of Population density 3097.74298489245"
podensity_median <- median(midwest_states$popdensity)   #Median of population density
print(paste0("Median of Population density ",podensity_median))
## [1] "Median of Population density 1156.20833"
quartile_popdensity = midwest_states$inmetro
print("Quartile of Population Density")
## [1] "Quartile of Population Density"
quantile(quartile_popdensity) 
##   0%  25%  50%  75% 100% 
##    0    0    0    1    1
#####Describe the key attributes used in the analysis
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked _by_ '.GlobalEnv':
## 
##     midwest
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
describe(midwest_states$area, size = "normalsize")
## midwest_states$area 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      437        0       70    0.998  0.03317  0.01522   0.0160   0.0200 
##      .25      .50      .75      .90      .95 
##   0.0240   0.0300   0.0380   0.0530   0.0622 
## 
## lowest : 0.005 0.009 0.010 0.011 0.012, highest: 0.079 0.082 0.089 0.094 0.110
describe(midwest_states$poptotal, size = "normalsize")
## midwest_states$poptotal 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      437        0      435        1    96130   130768     8068    12179 
##      .25      .50      .75      .90      .95 
##    18840    35324    75651   162850   301612 
## 
## 0 (154, 0.352), 50000 (172, 0.394), 1e+05 (45, 0.103), 150000 (24, 0.055),
## 2e+05 (9, 0.021), 250000 (7, 0.016), 3e+05 (6, 0.014), 350000 (3, 0.007),
## 450000 (2, 0.005), 5e+05 (4, 0.009), 550000 (1, 0.002), 7e+05 (1, 0.002),
## 8e+05 (2, 0.005), 850000 (1, 0.002), 950000 (2, 0.005), 1100000 (1,
## 0.002), 1400000 (1, 0.002), 2100000 (1, 0.002), 5100000 (1, 0.002)
describe(midwest_states$popdensity, size = "normalsize")
## midwest_states$popdensity 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      437        0      436        1     3098     4169    273.5    387.3 
##      .25      .50      .75      .90      .95 
##    622.4   1156.2   2330.0   5506.0  10092.1 
## 
## lowest :    85.0500   104.7818   110.6933   113.5128   130.9149
## highest: 39083.3000 54313.0769 60333.9143 63951.6667 88018.3966
describe(midwest_states$popblack, size = "normalsize")
## midwest_states$popblack 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      437        0      306        1    11024    20844        4        9 
##      .25      .50      .75      .90      .95 
##       29      201     1291    12115    25609 
## 
## 0 (375, 0.858), 10000 (26, 0.059), 20000 (13, 0.030), 30000 (5, 0.011),
## 40000 (4, 0.009), 60000 (1, 0.002), 70000 (2, 0.005), 80000 (2, 0.005),
## 1e+05 (1, 0.002), 120000 (1, 0.002), 150000 (1, 0.002), 170000 (1, 0.002),
## 180000 (1, 0.002), 2e+05 (1, 0.002), 350000 (1, 0.002), 850000 (1, 0.002),
## 1320000 (1, 0.002)
describe(midwest_states$popamerindian, size = "normalsize")
## midwest_states$popamerindian 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      437        0      271        1    343.1    496.7     15.0     22.0 
##      .25      .50      .75      .90      .95 
##     44.0     94.0    288.0    730.8   1211.2 
## 
## lowest :     4     6     7     8     9, highest:  3869  3948  6994  8048 10289
describe(midwest_states$popasian, size = "normalsize")
## midwest_states$popasian 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      437        0      274        1     1310     2354       10       16 
##      .25      .50      .75      .90      .95 
##       35      102      401     1666     3429 
##                                                                          
## Value           0   2000   4000   6000   8000  10000  12000  16000  18000
## Frequency     375     39      6      2      4      2      2      1      1
## Proportion  0.858  0.089  0.014  0.005  0.009  0.005  0.005  0.002  0.002
##                                              
## Value       20000  22000  26000  40000 188000
## Frequency       1      1      1      1      1
## Proportion  0.002  0.002  0.002  0.002  0.002
describe(midwest_states$popother, size = "normalsize")
## midwest_states$popother 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      437        0      256        1     1613     3017        5        8 
##      .25      .50      .75      .90      .95 
##       20       66      345     1263     2984 
##                                                            
## Value           0   5000  10000  15000  20000  25000 385000
## Frequency     410     16      4      2      2      2      1
## Proportion  0.938  0.037  0.009  0.005  0.005  0.005  0.002
describe(midwest_states$percollege, size = "normalsize")
## midwest_states$percollege 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      437        0      437        1    18.27    6.478    11.24    12.25 
##      .25      .50      .75      .90      .95 
##    14.11    16.80    20.55    26.63    31.10 
## 
## lowest :  7.336108  7.913256  8.543751  8.741730  8.845888
## highest: 41.295808 42.131150 42.768867 43.626451 48.078510
describe(midwest_states$poppovertyknown, size = "normalsize")
## midwest_states$poppovertyknown 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      437        0      436        1    93642   127656     7968    11811 
##      .25      .50      .75      .90      .95 
##    18364    33788    72840   156285   296908 
## 
## 0 (157, 0.359), 50000 (174, 0.398), 1e+05 (44, 0.101), 150000 (21, 0.048),
## 2e+05 (8, 0.018), 250000 (9, 0.021), 3e+05 (4, 0.009), 350000 (3, 0.007),
## 450000 (3, 0.007), 5e+05 (3, 0.007), 550000 (1, 0.002), 7e+05 (1, 0.002),
## 750000 (1, 0.002), 8e+05 (1, 0.002), 850000 (1, 0.002), 950000 (2, 0.005),
## 1050000 (1, 0.002), 1400000 (1, 0.002), 2100000 (1, 0.002), 5e+06 (1,
## 0.002)
describe(midwest_states$percbelowpoverty, size = "normalsize")
## midwest_states$percbelowpoverty 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      437        0      437        1    12.51    5.431    5.941    6.664 
##      .25      .50      .75      .90      .95 
##    9.199   11.822   15.133   17.981   21.544 
## 
## lowest :  2.180168  2.714734  3.121060  3.237628  3.385290
## highest: 28.528049 28.673385 30.184115 32.244278 48.691099
describe(midwest_states$StateName, size = "normalsize")
## midwest_states$StateName 
##        n  missing distinct 
##      437        0        5 
##                                                             
## Value       Illinois   Indiana  Michigan      Ohio Wisconsin
## Frequency        102        92        83        88        72
## Proportion     0.233     0.211     0.190     0.201     0.165

2.3 Data Observation/Conclusion

Looking at the summary and profiling results, it has good quality of data which has no missing records

Correlation plot says that direct correlation of poverty on diversed people

library(corrplot)
## corrplot 0.84 loaded
midwest_states_corr <- subset(midwest_states, select=c(poptotal,    popdensity, popwhite,   popblack,   popamerindian,  popasian,   popother, poppovertyknown)) 
corr_data <- cor(midwest_states_corr)
par(mfrow=c(1,1))
corrplot(corr_data, method = "pie")

3.Data Wrangling

Created a subset of data using few attributes with aggragated value based on States

Changed Column names for aggragated value

Reshaping data using reshape2 library to plot histogram

midwest_states_prof_std <- sqldf("SELECT StateName,avg(percollege) as College_Percentage,avg(percprof) as Professor_Percentage from midwest_states group by StateName")
#midwest_states_prof_std
library(reshape2)
reshape_prof_college <- melt(midwest_states_prof_std)
## Using StateName as id variables
#reshape_prof_college
library(ggplot2)
ggplot(reshape_prof_college, aes(x=StateName, y= value , fill=variable)) + geom_bar(stat="identity",position="dodge", colour="black", hjust = 0.5, vjust = 0.3) + scale_fill_discrete(name = "Legend") + labs( y = "Percentage", x = "Mid-west States") + ggtitle("Ratio of College and Professor in Mid west States") +
  theme(plot.title = element_text(hjust = 0.5))
## Warning: Ignoring unknown parameters: hjust, vjust

3.1 Data Wrangling Conclusion

Wisconsion State College and Professor Ratio is less compared to other states

Indiana State College and Professor Ratio is more compared to other states

4.Data Visulazation

Plotting graphs to demonstrate the various factors involved in area,diversity and poverty in mid-west states

4.1 Area coverage on different States

#hist(midwest_states$area,freq=FALSE,xlab = "Midwest Area",col = "lightgreen",border = "blue", main="Distribution of Area in Mid-west States") +
#  curve(dnorm(x,mean=mean(midwest_states$area),sd=sd(midwest_states$area),add=TRUE,col="darkblue"))
State.Name <- midwest_states$StateName   
ggplot(midwest_states, aes(midwest_states$area, fill = State.Name)) +
  geom_histogram() + labs(x="Area", y = "count") + ggtitle("Demographic Area Coverage on States") +
  theme(plot.title = element_text(hjust = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Ohio and Indiana has more area compared to other states

4.2 Diversity of population

midwest_diversity <- sqldf("SELECT StateName,sum(popwhite) pop_white,sum(popblack) pop_black,sum(popamerindian) pop_amerindian,
                           sum(popasian) pop_asian,sum(popother) pop_other from midwest_states group by StateName")
#midwest_diversity
reshape_diversity <- melt(midwest_diversity)
## Using StateName as id variables
#reshape_diversity
ggplot(reshape_diversity, aes(x = factor(StateName), y = value, colour = variable)) + 
    geom_point() + labs(y = "Population Number", x = "Mid-west States") + ggtitle("Diversity of population on different States") +
  theme(plot.title = element_text(hjust = 0.5))

Michigan,Ohio and Wisconsion has almost same pattern of diversity for asian,amerindian and other population

4.3 Poverty vs Education

ggplot(data = midwest_states, aes(y = percbelowpoverty, x = percollege)) + geom_point((aes(color = StateName))) + xlab("College Education") + ylab("Total Poverty") + ggtitle("Poverty Vs Education % in Midwest States") + theme(plot.title = element_text(hjust = 0.5))

Poverty is more when education is low accross all the states

4.4 Total Population vs Poverty population

midwest_poverty <- subset(midwest_states,select=c(poptotal,poppovertyknown))
#midwest_poverty
ggplot(midwest_poverty, aes(poptotal, poppovertyknown, color = poppovertyknown)) +
  geom_point(shape = 16, size = 5, show.legend = FALSE) +
  theme_minimal() +
  scale_color_gradient(low = "#0091ff", high = "#f0650e") +
  labs( y = "Poverty Total", x = "Population Total") + ggtitle("Total Population vs Poverty Population") +
  theme(plot.title = element_text(hjust = 0.5))

Total population is directly proportional with poverty

4.5 Below Poverty level on States

midwest_poverty <- subset(midwest_states,select=c(StateName,percbelowpoverty))
#Graph
qplot( x=StateName , y=percbelowpoverty , data=midwest_poverty , geom=c("boxplot","jitter") , fill=StateName) + ggtitle("Below Poverty Level on States") +
  theme(plot.title = element_text(hjust = 0.5))

Illinois is the state where the below povety is more scattered

5.Conclusion

  • Number of Colleges and professor ratio is not equally distributed among states, the less ratio might hit education quality
  • Diversity among states are not even, there could be certain benefits on few states or the more diversified State has better living opputunity
  • Poverty is more when education is low accross all the states, shows that states has to improve the education
  • Poverty level increases when total population increase shows that living style has not significantly changed in all the states