DATA 607 Project 2

Data Set 1: DACA

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.4.1

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

## Warning: package 'tidyr' was built under R version 3.4.1

library(stringr)

## Warning: package 'stringr' was built under R version 3.4.1

First we pull the relevant data from the csv file, which I retrieved from https://catalog.data.gov/, which was posted by Durley Torres-Marin. In this case there are several tables on one file separated by text. I’ll use read.table as this has the capability to start and stop on certain lines. Some of the countries have commas in the name so we need to take care of that before we load.

Again, using windows Power-Shell >get-content .\daca_population_data.csv|%{$_ -replace “, ”, “;”}|set-content .\daca_population_data2.csv

In this case looking at the raw data was important because the table had wrapped large numbers in double quotes to use commas. I had to set the quote to " make sure the table read correctly.

DACA Counties of Origin

We will begin by examining the countries of origin of DACA recipients.

daca_countries <- "daca_population_data2.csv" %>% read.table(sep = ",", skip=5, nrows=153, quote = "\"") %>% tbl_df()
#Now that we've read the data we can put the commas back in.
daca_countries$V1<-daca_countries$V1 %>% str_replace_all(";", ",")
#We can also remove the commas in the numbers and set then as integer.
daca_countries$V2<-daca_countries$V2 %>% str_replace_all( ",", "") %>% as.integer()
#Finally we can rename the columns.
daca_countries<-daca_countries %>% rename(Country=V1,Population=V2,Percent=V3)
head(daca_countries)

## # A tibble: 6 x 3
##       Country Population Percent
##         <chr>      <int>   <dbl>
## 1                     NA      NA
## 2       Total     689800   100.0
## 3      Mexico     548000    79.4
## 4 El Salvador      25900     3.7
## 5   Guatemala      17700     2.6
## 6    Honduras      16100     2.3

tail(daca_countries)

## # A tibble: 6 x 3
##           Country Population Percent
##             <chr>      <int>   <dbl>
## 1     Afghanistan         10     0.0
## 2      Azerbaijan         10     0.0
## 3         Moldova         10     0.0
## 4         Namibia         10     0.0
## 5 Other Countries        200     0.0
## 6   Not Available       1800     0.3

There are so many countries that the data is hard to visualize. So taking a subset may be useful.

par(las=2)
barplot(daca_countries$Percent,names.arg =daca_countries$Country,horiz = TRUE,cex.names = 0.1)

summary(daca_countries$Population[-2])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      10      35     110    4570     350  548000       1

The summary of the Population data shows that the data are highly right-skewed with a median of 110 but a mean of 4570.

daca_highest_countries <- daca_countries %>% filter(Percent >= 0.3, Percent < 100.0)

## Warning: package 'bindrcpp' was built under R version 3.4.1

daca_highest_countries

## # A tibble: 18 x 3
##               Country Population Percent
##                 <chr>      <int>   <dbl>
##  1             Mexico     548000    79.4
##  2        El Salvador      25900     3.7
##  3          Guatemala      17700     2.6
##  4           Honduras      16100     2.3
##  5               Peru       7420     1.1
##  6       Korea, South       7310     1.1
##  7             Brazil       5780     0.8
##  8            Ecuador       5460     0.8
##  9           Colombia       5020     0.7
## 10          Argentina       3970     0.6
## 11        Philippines       3880     0.6
## 12              India       2640     0.4
## 13            Jamaica       2640     0.4
## 14          Venezuela       2480     0.4
## 15 Dominican Republic       2430     0.4
## 16            Uruguay       1930     0.3
## 17  Trinidad & Tobago       1930     0.3
## 18      Not Available       1800     0.3

par(las=2)
barplot(daca_highest_countries$Percent,names.arg =daca_highest_countries$Country,horiz = TRUE,cex.names = 0.4)

daca_countries_no_outliers <- daca_countries %>% filter(Population <= 1000)
boxplot(daca_countries_no_outliers$Population)

We see from the bar plot that Mexico has be far the largest contribution to the DACA program, most of the other top contributors are from the Western Hemisphere with South Korea, India, and The Philippines being the only Eastern Hemisphere countries that contribute more than 0.3% of the DACA population. The box plot shows us that the majority of countries of origin in the DACA program have fewer than 1000 people.

DACA States of Residence

We will now examine the States of Residence of DACA recipients.

daca_states <- "daca_population_data2.csv" %>% read.table(sep = ",", skip=174, nrows=59, quote = "\"") %>% tbl_df()
#Now that we've read the data we can put the commas back in.
daca_states$V1<-daca_states$V1 %>% str_replace_all(";", ",")
#We can also remove the commas in the numbers and set then as integer.
daca_states$V2<-daca_states$V2 %>% str_replace_all(",", "")%>% as.integer()
#Finally we can rename the columns.
daca_states<-daca_states %>% rename(State=V1,Population=V2,Percent=V3)
head(daca_states)

## # A tibble: 6 x 3
##        State Population Percent
##        <chr>      <int>   <dbl>
## 1 California     197900    28.7
## 2      Texas     113000    16.4
## 3   Illinois      35600     5.2
## 4   New York      32900     4.8
## 5    Florida      27000     3.9
## 6    Arizona      25500     3.7

tail(daca_states)

## # A tibble: 6 x 3
##                      State Population Percent
##                      <chr>      <int>   <dbl>
## 1                     Guam         20     0.0
## 2 Northern Mariana Islands         10     0.0
## 3                  Vermont         10     0.0
## 4        Other territories         10     0.0
## 5            Not Available       4180     0.6
## 6                                  NA      NA

summary(daca_states$Population)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##     10.0    352.5   4090.0  11892.9   8975.0 197900.0        1

Again, median and mean are offset meaning that the data are right skewed with the mean being higher than the median. This means that a few states are home to a larger pool of DACA recipients than other states. You can see several outliers with population >20,000 in the boxplot with the top two states omitted for clarity.

daca_states_no_outliers <- daca_states %>% filter(Population < 50000)
boxplot(daca_states_no_outliers$Population)

par(las=2)
daca_top_states <- daca_states %>% filter(Percent >= 2.4)
barplot(daca_top_states$Percent,names.arg =daca_top_states$State,horiz = TRUE,cex.names = 0.5)

We see that California and Texas have the largest population of DACA recipients. Given that both states were once part of Mexico, share a border with Mexico, and as a result have large communities of Hispanic people. In this category are Florida and Arizona. Furthermore, the list contains states with large cities, such as Illinois and New York. This is not a surprise result, given that the majority of DACA recipients come from Latin American countries, particularly Mexico. Immigrants will tend to settle in communities that have economic opportunity and with communities that share their native language and culture.

DACA Metropolitian Areas of Residence

daca_cities <- "daca_population_data2.csv"%>% read.table(sep = ",", skip=248, nrows=91, quote = "\"")%>% tbl_df()
#Now that we've read the data we can put the commas back in.
daca_cities$V1<-daca_cities$V1 %>% str_replace_all(";", ",")
#We can also remove the commas in the numbers and set then as integer.
daca_cities$V2<-daca_cities$V2 %>% str_replace_all(",", "") %>% as.integer()
#Finally we can rename the columns.
daca_cities<-daca_cities %>% rename(Metro=V1,Population=V2,Percent=V3)
head(daca_cities)

## # A tibble: 6 x 3
##                                  Metro Population Percent
##                                  <chr>      <int>   <dbl>
## 1    Los Angeles-Long Beach-Anaheim,CA      89900    13.0
## 2 New York-Newark-Jersey City,NY-NJ-PA      47200     6.8
## 3       Dallas-Fort Worth-Arlington,TX      36700     5.3
## 4  Houston-The Woodlands-Sugar Land,TX      35800     5.2
## 5    Chicago-Naperville-Elgin,IL-IN-WI      34100     5.0
## 6  Riverside-San Bernardino-Ontario,CA      22300     3.2

tail(daca_cities)

## # A tibble: 6 x 3
##                  Metro Population Percent
##                  <chr>      <int>   <dbl>
## 1  Cincinnati,OH-KY-IN       1000     0.2
## 2      St. Louis,MO-IL       1000     0.2
## 3 Lexington-Fayette,KY       1000     0.2
## 4           Other CBSA      86700    12.6
## 5             Non-CBSA      12300     1.8
## 6        Not Available       6600     1.0

summary(daca_cities$Population)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    1600    2800    7580    6050   89900

Metropolitan areas of residence follow the same pattern as above. The difference in the mean and median suggest that a few metropolitan areas are home to a larger percentage of DACA recipients. These cities are in the states that contain the larger populations, due to the states history and proximity to the larger countries of origin.

par(las=2)
daca_top_cities <- daca_cities %>% filter(Percent >= 2.2)
barplot(daca_top_cities$Percent,names.arg =daca_top_cities$Metro,horiz = TRUE,cex.names = 0.2)

daca_top_cities

## # A tibble: 10 x 3
##                                   Metro Population Percent
##                                   <chr>      <int>   <dbl>
##  1    Los Angeles-Long Beach-Anaheim,CA      89900    13.0
##  2 New York-Newark-Jersey City,NY-NJ-PA      47200     6.8
##  3       Dallas-Fort Worth-Arlington,TX      36700     5.3
##  4  Houston-The Woodlands-Sugar Land,TX      35800     5.2
##  5    Chicago-Naperville-Elgin,IL-IN-WI      34100     5.0
##  6  Riverside-San Bernardino-Ontario,CA      22300     3.2
##  7           Phoenix-Mesa-Scottsdale,AZ      22000     3.2
##  8     Atlanta-Sandy Springs-Roswell,GA      15700     2.3
##  9     San Francisco-Oakland-Hayward,CA      15500     2.3
## 10                           Other CBSA      86700    12.6

Once more we see a pattern that is consistent with the patterns above. The cities that have the most DACA recipients are in the states that have the most DACA recipients. The interesting feature of these data is that there are nearly as many DACA recipients living in smaller metropolitan areas, here lumped together as “Other CBSA - Core Based Statistical Area”

DACA Recipient Sex Distribution

daca_sex <- "daca_population_data2.csv"%>% read.table(sep = ",", skip=356, nrows=3, quote = "\"") %>% tbl_df()
#Now that we've read the data we can put the commas back in.
#We can also remove the commas in the numbers and set then as integer.
daca_sex$V2<-daca_sex$V2 %>% str_replace_all(",", "")%>% as.integer()
#Finally we can rename the columns.
daca_sex<-daca_sex %>% rename(Sex=V1,Population=V2,Percent=V3)
daca_sex

## # A tibble: 3 x 3
##             Sex Population Percent
##          <fctr>      <int>   <dbl>
## 1        Female     362700    52.6
## 2          Male     326900    47.4
## 3 Not available        200     0.0

par(las=2)
barplot(daca_sex$Population,names.arg =daca_sex$Sex,horiz = TRUE,cex.names = 0.5)

Seems like there is a statistically significant larger amount of female DACA recipient than male. I’m curious as to why. To verify, according to US census data, https://www.census.gov/prod/cen2010/briefs/c2010br-03.pdf, males outnumber females in every age category < 24 years old, which as we will see shortly is the median age of the DACA population. Females account for 48.8% of the population < 24 years old in the US. http://www.r-tutor.com/elementary-statistics/hypothesis-testing/two-tailed-test-population-proportion

males <- 10319427+10389638+10579862+11303666+11014176
females <- 9881935+9959019+10097332+10736677+10571823
n = sum(daca_sex$Population)
p = daca_sex$Population[1]/n
p

## [1] 0.5258046

p0 = females/(males+females)
p0

## [1] 0.4887463

z = (p-p0)/sqrt(p0*(1-p0)/n) 
z

## [1] 61.57253

The typical Z-score for significance at the 95% level is $\pm 1.96$:

alpha = 0.05
z_alpha_half = qnorm(1-alpha/2)
z_alpha_half

## [1] 1.959964

There is a signifcant gender bias in the DACA program. My hypothesis as to why is that young Latin American Men are more likely to be arrested. From the DACA FAQ: “have not been convicted of a felony offense, a significant misdemeanor, or more than three misdemeanors of any kind;” http://www.immigrationequality.org/get-legal-help/our-legal-resources/path-to-status-in-the-u-s/daca-deferred-action-for-childhood-arrivals/

Hispanics make up 27.5% of NYC’s population (1) for example, but 36.1% of arrests(2), and men are nearly 3 times more likely to be arrested than women(3).

(1)https://en.wikipedia.org/wiki/Demographics_of_New_York_City

(2)http://www1.nyc.gov/assets/nypd/downloads/pdf/analysis_and_planning/year-end-2016-enforcement-report.pdf

(3)https://www.bjs.gov/content/pub/pdf/aus9010.pdf

DACA Recpient Age Distribution.

daca_age <- "daca_population_data2.csv" %>% read.table(sep = ",", skip=363, nrows=6, quote = "\"")%>% tbl_df()
#Now that we've read the data we can put the commas back in.
#We can also remove the commas in the numbers and set then as integer.
daca_age$V2<-daca_age$V2 %>% str_replace_all(",", "") %>% as.integer()
#Finally we can rename the columns.
daca_age<-daca_age %>% rename(Age=V1,Population=V2,Percent=V3)
daca_age

## # A tibble: 6 x 3
##             Age Population Percent
##          <fctr>      <int>   <dbl>
## 1      Under 16       2000     0.3
## 2         16-20     196500    28.5
## 3         21-25     253100    36.7
## 4         26-30     163400    23.7
## 5         31-36      74900    10.9
## 6 Not available          5     0.0

daca_age_stats <- "daca_population_data2.csv" %>% read.table(sep = ",", skip=370, nrows=3, quote = "\"")%>%tbl_df()
daca_age_stats<-daca_age_stats %>% rename(Stat=V1,Age=V2)
#We need to drop an empty column from this table
daca_age_stats <- daca_age_stats %>% select(-V3)
daca_age_stats

## # A tibble: 3 x 2
##                  Stat      Age
##                <fctr>   <fctr>
## 1         Average Age     23.8
## 2          Median Age       23
## 3 Interquartile Range 20 to 27

par(las=2)
barplot(daca_age$Population,names.arg =daca_age$Age,horiz = TRUE,cex.names = 0.5)

The age requirements for DACA are:

You were under the age of 31 as of June 15, 2012; You entered the United States prior to your 16th birthday; You have resided in the United States since June 15, 2007 and currently are present in the U.S.; You were in the United States on June 15, 2012 and must be physically in the U.S. at the time of filing for your request for deferred action;

These data are as of September 4th 2017. The lack of DACA recipients under 16 is striking. This may be due to parents being reticent to sign their minor children up, out of fear of immigration action.

DACA Martial Status

daca_mrd <- "daca_population_data2.csv" %>% read.table(sep = ",", skip=390, nrows=5, quote = "\"") %>% tbl_df()
#Now that we've read the data we can put the commas back in.
#We can also remove the commas in the numbers and set then as integer.
daca_mrd$V2<-daca_mrd$V2 %>% str_replace_all(",", "")%>% as.integer()
#Finally we can rename the columns.
daca_mrd<-daca_mrd %>% rename(Status=V1,Population=V2,Percent=V3)
daca_mrd

## # A tibble: 5 x 3
##          Status Population Percent
##          <fctr>      <int>   <dbl>
## 1        Single     571600    82.9
## 2       Married     105700    15.3
## 3      Divorced       7800     1.1
## 4       Widowed        300     0.1
## 5 Not available       4400     0.6

par(las=2)
barplot(daca_mrd$Percent,names.arg =daca_mrd$Status,horiz = TRUE,cex.names = 0.5)

DACA is targeted at young people, so the low marriage rate is not surprising. Given that it is young people that are married here, the low Divorce Rate is somewhat surprising. This may be a result of fairly new marriages and the large amount of Latinos in the DACA population, who tend to have lower Divorce Rates overall: http://www.healthymarriageinfo.org/research-and-policy/marriage-facts/culture/Hispanics-and-Latinos/index.aspx

My impression of these data is that the DACA program seems very successful for giving an opportunity for young undocumented immigrants a opportunity to become American Citizens. The bias in the Sex Distribution does warrent further investigation, as it may be due to institutional racism and sexism that may be external to the program itself. Otherwise there is no evidence in these data that the program merited suspension by the current administration.

DATA Set 2: MOJAVE Blazars

For the next set of data I’ll be scraping the webpage, posted by me, as this was my research group when I was at Purdue: http://www.physics.purdue.edu/astro/MOJAVE/velocitytable.html which is a sample of Active Galactic Nuclei (AGN) called Blazars. A subclass that is better known are Quasars, in fact some Blazars are Quasars albeit more compact in size and luminous. What I am after is just the object’s names and their jet feature speeds in units of the speed of light (c). What makes Blazars a special type of AGN is that the central Super Massive Black holes eject plasma clouds (the jet features) at close to the speed of light. Since we are looking down the jet and length toward us gets contracted due to special relativity, these features appear to move faster than light in our frame of reference.

Please note that I am very familiar with loading data into a program using SQL databases and .csv files, as that was part of my job at Purdue. I have some PERL scripts posted to my github that are written just for that purpose. I really wanted to give web scraping a try, as I had never done it before.

First we will use rvest:

library(rvest)

## Warning: package 'rvest' was built under R version 3.4.2

## Loading required package: xml2

## Warning: package 'xml2' was built under R version 3.4.1

Next, using https://blog.rstudio.com/2014/11/24/rvest-easy-web-scraping-with-r/ as a guide, we will set up the web scraping:

url <- "http://www.physics.purdue.edu/astro/MOJAVE/velocitytable.html"
mojave <- url %>% read_html() %>% html_node(xpath='//*[@id="myTable"]') %>% html_table()
head(mojave)

##   B1950\nName ID Epochs Flux Density  (mJy) Distance (mas) Distance (pc)
## 1    0003+380  1      9                   5           4.24         15.40
## 2    0003+380  2      7                  17           1.81          6.57
## 3    0003+380  4      6                  17           1.25          4.54
## 4    0003+380  5      9                  39           0.76          2.77
## 5    0003+380  6     10                  98           0.39          1.43
## 6    0003-066  2      5                 222           1.05          5.12
##   P.A. (deg.) Velocity P.A.(deg.) Vector Motion Offset (deg.)
## 1       120.5            97  ± 16                    24  ± 16
## 2       112.6        119.7  ± 2.6                  7.1  ± 2.7
## 3       114.8           208  ± 12                   93  ± 12c
## 4       117.5          331  ± 119                  146  ± 119
## 5       115.4           335  ± 41                   141  ± 41
## 6       322.9        226.3  ± 4.8                96.6  ± 5.0c
##   Speed (muas/y) Speed (c) Perp. Accel (muas/y2) Par. Accel (muas/y2)
## 1      157  ± 42 2.29±0.61                                           
## 2      319  ± 22 4.64±0.32                                           
## 3    36.3  ± 9.2 0.53±0.13                                           
## 4    4.8  ± 7.2e 0.07±0.10                                           
## 5   12.7  ± 8.3e 0.19±0.12                                           
## 6      191  ± 15 4.09±0.33                                           
##   Ejection Epoch Reference epoch  t_R.A.  t_dec.
## 1                       2008.805 1985.09 1888.98
## 2                       2007.705 2001.68 2003.32
## 3                       2009.535 2075.66 1992.93
## 4                       2010.262 2300.95 2092.52
## 5                       2009.897 2073.84 2023.64
## 6                       1997.798 1993.56 2004.48

MOJAVE Flux Density vs Feature Size

Now we want to get just the names and the Flux Density of the features (how bright they are at the telescope over a band of frequencies) and their distances in parsecs (how wide the features are from the jet feature to the core, note 1 pc = 3.25 light-years). Maybe there is a relationship between how bright a feature is and how big it is.

mojave_flux <- mojave %>% select(`B1950
Name`, `Flux Density  (mJy)`,`Distance (mas)`,`Distance (pc)`) 
#We need to drop the NA's since some sources to not have reliable distance to source measurements.
mojave_flux <- mojave_flux %>% drop_na()
head(mojave_flux)

##   B1950\nName Flux Density  (mJy) Distance (mas) Distance (pc)
## 1    0003+380                   5           4.24         15.40
## 2    0003+380                  17           1.81          6.57
## 3    0003+380                  17           1.25          4.54
## 4    0003+380                  39           0.76          2.77
## 5    0003+380                  98           0.39          1.43
## 6    0003-066                 222           1.05          5.12

#The first test looks at linear size of the feature vs. brightness at the telescope.
fit <- lm(log(mojave_flux$`Distance (pc)`, base = 10)~log(mojave_flux$`Flux Density  (mJy)`, base =10))
# log-log transformations were done as the data did not appear linear un-transformed.
summary(fit)

## 
## Call:
## lm(formula = log(mojave_flux$`Distance (pc)`, base = 10) ~ log(mojave_flux$`Flux Density  (mJy)`, 
##     base = 10))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.36953 -0.30004  0.02323  0.34426  1.77375 
## 
## Coefficients:
##                                                   Estimate Std. Error
## (Intercept)                                        1.41846    0.05687
## log(mojave_flux$`Flux Density  (mJy)`, base = 10) -0.26923    0.02851
##                                                   t value Pr(>|t|)    
## (Intercept)                                        24.944   <2e-16 ***
## log(mojave_flux$`Flux Density  (mJy)`, base = 10)  -9.445   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5762 on 1235 degrees of freedom
## Multiple R-squared:  0.06737,    Adjusted R-squared:  0.06661 
## F-statistic: 89.21 on 1 and 1235 DF,  p-value: < 2.2e-16

plot(log(mojave_flux$`Flux Density  (mJy)`,base=10), log(mojave_flux$`Distance (pc)`,base=10), col="blue")
abline(fit, col="red")

#Calculating distance from angular size is model dependant so it may be better to use angular size instead.
#Both Flux Density and Distance in milli-arc-seconds (1/3,600,000th of a degree) are direct measurments.
fit2 <- lm(log(mojave_flux$`Distance (mas)`,base = 10)~log(mojave_flux$`Flux Density  (mJy)`,base=10))
summary(fit2)

## 
## Call:
## lm(formula = log(mojave_flux$`Distance (mas)`, base = 10) ~ log(mojave_flux$`Flux Density  (mJy)`, 
##     base = 10))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0045 -0.3087 -0.0217  0.2668  1.5226 
## 
## Coefficients:
##                                                   Estimate Std. Error
## (Intercept)                                        0.81006    0.04302
## log(mojave_flux$`Flux Density  (mJy)`, base = 10) -0.28279    0.02156
##                                                   t value Pr(>|t|)    
## (Intercept)                                         18.83   <2e-16 ***
## log(mojave_flux$`Flux Density  (mJy)`, base = 10)  -13.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4359 on 1235 degrees of freedom
## Multiple R-squared:  0.1223, Adjusted R-squared:  0.1215 
## F-statistic:   172 on 1 and 1235 DF,  p-value: < 2.2e-16

plot(log(mojave_flux$`Flux Density  (mJy)`,base=10), log(mojave_flux$`Distance (mas)`,base=10), col="violet")
abline(fit2, col="green")

There is a statistically significant tendency for smaller features to produce brighter sources. If We were preparing these data for publication, my next step would be to use non parametric correlation coefficients and partial correlation coefficients to factor out redshift, which is a proxy for distance to source, and the equations relating these measurements are likely to be non-linear. Additionally, this could just be due to our ability to resolve smaller objects that are closer.

MOJAVE Jet Speeds

I now want to look at the speeds of individual jet features.

mojave_speeds <- mojave %>% select(`B1950
Name`, `Flux Density  (mJy)` , `Speed (c)`)
head(mojave_speeds)

##   B1950\nName Flux Density  (mJy) Speed (c)
## 1    0003+380                   5 2.29±0.61
## 2    0003+380                  17 4.64±0.32
## 3    0003+380                  17 0.53±0.13
## 4    0003+380                  39 0.07±0.10
## 5    0003+380                  98 0.19±0.12
## 6    0003-066                 222 4.09±0.33

We need to use the $\pm$ symbol to separate out the uncertainty, and drop the missing data.

mjv_sep_speeds <- mojave_speeds %>% separate('Speed (c)', into= c("Speed (c)" , "Error (c)"), sep="±", remove=TRUE) %>% drop_na()

## Warning: Too few values at 58 locations: 14, 15, 66, 67, 68, 94, 95, 114,
## 115, 116, 117, 145, 146, 147, 148, 149, 150, 151, 152, 208, ...

mjv_sep_speeds$`Speed (c)` <- as.numeric(mjv_sep_speeds$`Speed (c)`) 
mjv_sep_speeds$`Error (c)` <- as.numeric(mjv_sep_speeds$`Error (c)`) 
mjv_sep_speeds$`Speed (c)` %>%  summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.140   4.890   6.537   9.800  39.700

head(mjv_sep_speeds)

##   B1950\nName Flux Density  (mJy) Speed (c) Error (c)
## 1    0003+380                   5      2.29      0.61
## 2    0003+380                  17      4.64      0.32
## 3    0003+380                  17      0.53      0.13
## 4    0003+380                  39      0.07      0.10
## 5    0003+380                  98      0.19      0.12
## 6    0003-066                 222      4.09      0.33

fit3 <- lm(mjv_sep_speeds$`Speed (c)`~ log(mjv_sep_speeds$`Flux Density  (mJy)`,base=10))
summary(fit3)

## 
## Call:
## lm(formula = mjv_sep_speeds$`Speed (c)` ~ log(mjv_sep_speeds$`Flux Density  (mJy)`, 
##     base = 10))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.709 -5.370 -1.681  3.317 33.191 
## 
## Coefficients:
##                                                      Estimate Std. Error
## (Intercept)                                            6.1684     0.6386
## log(mjv_sep_speeds$`Flux Density  (mJy)`, base = 10)   0.1931     0.3201
##                                                      t value Pr(>|t|)    
## (Intercept)                                            9.659   <2e-16 ***
## log(mjv_sep_speeds$`Flux Density  (mJy)`, base = 10)   0.603    0.546    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.471 on 1235 degrees of freedom
## Multiple R-squared:  0.0002946,  Adjusted R-squared:  -0.0005149 
## F-statistic: 0.3639 on 1 and 1235 DF,  p-value: 0.5465

plot(log(mjv_sep_speeds$`Flux Density  (mJy)`,base=10), mjv_sep_speeds$`Speed (c)`, col="cyan")
abline(fit3, col = "magenta")

Since there are jet feature with 0 speed, we had to use a semi-log plot because $\lim_{x \rightarrow 0} \log{x} = -\infty$. These data seem fairly scattered around, the trend line is flat, and there is no statistically significant relationship between jet feature speed and feature brightness. In other words there is no connection between the oscillation frequency of the plasma (brightness), and it’s bulk motion (jet speed).

Data Set 3: NYPD Traffic Logs

For the last data set for this project we will analyze NYPD traffic data. This was posted to the discussion board by Yuen Chun Wong. Yeun asked that we look for:

For the specific data that Yuen posted, I see no data that measures time of collision. In fact these data are specific to September 2017. So, the first and last items are not measurable from these data. I am however interested in risk by cyclists and pedestrians so I will compare cyclist injury/death to motorist injury/death and then pedestrian injury/death to motorist injury/death. The tables that record vehicle type and location, and collision cause and location are very similar in how they were organized. I will focus on vehicle type and location, as the tidyr and dplyr tools to organize the data will be identical.

#The first row are the col titles so I cut them out and used rename instead.
nypd_traffic_coll <- "cityacc-en-us.csv" %>% read.table(sep = ",", skip=4, nrows=83, quote = "\"") %>% tbl_df()
nypd_traffic_coll

## # A tibble: 83 x 14
##        V1                     V2    V3    V4    V5    V6    V7    V8    V9
##    <fctr>                 <fctr> <int> <int> <int> <int> <int> <int> <int>
##  1      C               CITYWIDE 18727 37086  3701  2325     9  1648     2
##  2      M              MANHATTAN  3916  7571   579   249     1   205     0
##  3    001           1st Precinct   288   561    42    24     0    12     0
##  4    005           5th Precinct   203   376    40    12     0    14     0
##  5    006           6th Precinct   130   249    18     6     0     2     0
##  6    007           7th Precinct   102   195    22     9     0     8     0
##  7    009           9th Precinct   126   241    21     9     0     6     0
##  8    010          10th Precinct   246   479    20     7     0     7     0
##  9    013          13th Precinct   220   414    47    16     0    11     0
## 10    014 Midtown South Precinct   323   630    28     9     0     2     0
## # ... with 73 more rows, and 5 more variables: V10 <int>, V11 <int>,
## #   V12 <int>, V13 <int>, V14 <int>

nypd_traffic_coll <- nypd_traffic_coll %>% rename(GeoCode=V1, GeoCodeLabel=V2, Number_of_Motor_Vehicle_Collisions=V3,Vehicles_or_Motorists_Involved=V4, Injury_or_Fatal_Collisions=V5, MotoristsInjured=V6, MotoristsKilled=V7, PassengInjured=V8,PassengKilled=V9, CyclistsInjured=V10, CyclistsKilled=V11, PedestrInjured=V12,PedestrKilled=V13,  Bicycle=V14)
head(nypd_traffic_coll)

## # A tibble: 6 x 14
##   GeoCode GeoCodeLabel Number_of_Motor_Vehicle_Collisions
##    <fctr>       <fctr>                              <int>
## 1       C     CITYWIDE                              18727
## 2       M    MANHATTAN                               3916
## 3     001 1st Precinct                                288
## 4     005 5th Precinct                                203
## 5     006 6th Precinct                                130
## 6     007 7th Precinct                                102
## # ... with 11 more variables: Vehicles_or_Motorists_Involved <int>,
## #   Injury_or_Fatal_Collisions <int>, MotoristsInjured <int>,
## #   MotoristsKilled <int>, PassengInjured <int>, PassengKilled <int>,
## #   CyclistsInjured <int>, CyclistsKilled <int>, PedestrInjured <int>,
## #   PedestrKilled <int>, Bicycle <int>

Motor Vehicle Risk of Death and Injury

We will compare risk by analyzing the percentage of deaths and injuries of each vehicle type to the number of collisions involving that vehicle.

First a general overview of the Whole city and the 5 Burroughs:

city_burr <- nypd_traffic_coll %>% filter(GeoCode=="C"|GeoCode=="M"|GeoCode=="Q"|GeoCode=="K"|GeoCode=="B"|GeoCode=="S")
city_burr

## # A tibble: 6 x 14
##   GeoCode  GeoCodeLabel Number_of_Motor_Vehicle_Collisions
##    <fctr>        <fctr>                              <int>
## 1       C      CITYWIDE                              18727
## 2       M     MANHATTAN                               3916
## 3       B         BRONX                               2979
## 4       K      BROOKLYN                               5367
## 5       Q        QUEENS                               5421
## 6       S STATEN ISLAND                               1044
## # ... with 11 more variables: Vehicles_or_Motorists_Involved <int>,
## #   Injury_or_Fatal_Collisions <int>, MotoristsInjured <int>,
## #   MotoristsKilled <int>, PassengInjured <int>, PassengKilled <int>,
## #   CyclistsInjured <int>, CyclistsKilled <int>, PedestrInjured <int>,
## #   PedestrKilled <int>, Bicycle <int>

Total Collisions

First we create a subset of the data, and then create a bar plot.

city_burr_col <- city_burr %>% select(GeoCode, Number_of_Motor_Vehicle_Collisions)
city_burr_col$Number_of_Motor_Vehicle_Collisions %>% barplot(col=c("black", "red", "orange", "yellow", "green", "blue"), names.arg=city_burr_col$GeoCode)

Collisions by Population

According to this plot Brooklyn and Queens have the most collisions but they are the most populated Boroughs. We will bring in some additional data, population in NYC by Borough, to help with this analysis:

url2 <- "https://www.citypopulation.de/php/usa-newyorkcity.php"
city_pop <- url2 %>% read_html() %>% html_node(xpath='//*[@id="ts"]') %>% html_table()
city_pop

##                              Name  Status PopulationCensus1990-04-01
## 1                           Bronx Borough                  1,203,789
## 2         Brooklyn (Kings County) Borough                  2,300,664
## 3     Manhattan (New York County) Borough                  1,487,536
## 4                          Queens Borough                  1,951,598
## 5 Staten Island (Richmond County) Borough                    378,977
## 6                   New York City    City                  7,322,564
##   PopulationCensus2000-04-01 PopulationCensus2010-04-01
## 1                  1,332,244                  1,385,107
## 2                  2,465,689                  2,504,706
## 3                  1,538,096                  1,585,874
## 4                  2,229,394                  2,230,545
## 5                    443,762                    468,730
## 6                  8,009,185                  8,174,962
##   PopulationEstimate2016-07-01         
## 1                    1,455,720 <U+2192>
## 2                    2,629,150 <U+2192>
## 3                    1,643,734 <U+2192>
## 4                    2,333,054 <U+2192>
## 5                      476,015 <U+2192>
## 6                    8,537,673

#I did the part below by hand as the two tables are ordered differently and it wasn't obvivous to me how to sort to get the same order.
city_burr$Pop <- c(city_pop[6,6], city_pop[3,6], city_pop[1,6] ,city_pop[2,6], city_pop[4,6], city_pop[5,6])
city_burr$Pop<-city_burr$Pop %>% str_replace_all(",", "")%>% as.integer()
city_burr<- city_burr %>% mutate(Collisions_per_Pop = Number_of_Motor_Vehicle_Collisions/Pop)
city_burr

## # A tibble: 6 x 16
##   GeoCode  GeoCodeLabel Number_of_Motor_Vehicle_Collisions
##    <fctr>        <fctr>                              <int>
## 1       C      CITYWIDE                              18727
## 2       M     MANHATTAN                               3916
## 3       B         BRONX                               2979
## 4       K      BROOKLYN                               5367
## 5       Q        QUEENS                               5421
## 6       S STATEN ISLAND                               1044
## # ... with 13 more variables: Vehicles_or_Motorists_Involved <int>,
## #   Injury_or_Fatal_Collisions <int>, MotoristsInjured <int>,
## #   MotoristsKilled <int>, PassengInjured <int>, PassengKilled <int>,
## #   CyclistsInjured <int>, CyclistsKilled <int>, PedestrInjured <int>,
## #   PedestrKilled <int>, Bicycle <int>, Pop <int>,
## #   Collisions_per_Pop <dbl>

city_burr$Collisions_per_Pop %>% barplot(col=c("black", "red", "orange", "yellow", "green", "blue"), names.arg=city_burr_col$GeoCode)

When we factor in the population of each borough, the collisions numbers are very similar with all of them around 2 parts per mille.

Motor Vehicle Injury and Death

Not All accidents by car result in an injury or death. We will now look at car crashes that result in injury or death.

car_injury <- city_burr %>% select(GeoCode, Injury_or_Fatal_Collisions, MotoristsInjured, PassengInjured, MotoristsKilled, PassengKilled, Pop) %>% mutate(Total_per_pop = Injury_or_Fatal_Collisions/Pop, Total_injured = MotoristsInjured+PassengInjured, Total_injured_per_pop = Total_injured/Pop, Total_killed = MotoristsKilled + PassengKilled, Total_killed_per_pop = Total_killed/Pop, Killed_per_Injury = Total_killed/Total_injured)
car_injury

## # A tibble: 6 x 13
##   GeoCode Injury_or_Fatal_Collisions MotoristsInjured PassengInjured
##    <fctr>                      <int>            <int>          <int>
## 1       C                       3701             2325           1648
## 2       M                        579              249            205
## 3       B                        663              448            335
## 4       K                       1174              679            544
## 5       Q                       1106              811            492
## 6       S                        179              138             72
## # ... with 9 more variables: MotoristsKilled <int>, PassengKilled <int>,
## #   Pop <int>, Total_per_pop <dbl>, Total_injured <int>,
## #   Total_injured_per_pop <dbl>, Total_killed <int>,
## #   Total_killed_per_pop <dbl>, Killed_per_Injury <dbl>

car_injury$Total_per_pop %>% barplot(col=c("black", "red", "orange", "yellow", "green", "blue"), names.arg=car_injury$GeoCode ,ylab = "Injuries and Deathes per Population")

car_injury$Total_injured_per_pop %>% barplot(col=c("black", "red", "orange", "yellow", "green", "blue"), names.arg=car_injury$GeoCode ,ylab = "Injuries per Population")

car_injury$Total_killed_per_pop %>% barplot(col=c("black", "red", "orange", "yellow", "green", "blue"), names.arg=car_injury$GeoCode ,ylab = "Deathes per Population")

car_injury$Killed_per_Injury %>% barplot(col=c("black", "red", "orange", "yellow", "green", "blue"), names.arg=car_injury$GeoCode ,ylab = "Deaths per Injury")

According to these data about 1:4 car collisions result in injury or death, and 6:10,000 result in death of a motorist or passenger.

#averaged over all boroughs, city totals factored out of vectors.
length(city_burr[2:6])

## [1] 5

sum(city_burr$Injury_or_Fatal_Collisions[2:6]/city_burr$Number_of_Motor_Vehicle_Collisions[2:6])/length(city_burr[2:6])

## [1] 0.1929269

sum(car_injury$Total_killed[2:6]/city_burr$Number_of_Motor_Vehicle_Collisions[2:6])/length(city_burr[2:6])

## [1] 0.0005719659

Bicycle Collisions:

Our first step again is to create an appropriate subset of the data and create a bar plot.

city_cyc_col <- city_burr %>% select(GeoCode, CyclistsInjured, CyclistsKilled, Bicycle, Pop) %>% mutate(Bicylce_per_Pop=Bicycle/Pop, Total_injured_killed = CyclistsInjured+CyclistsKilled, Injured_killed_per_pop = Total_injured_killed/Pop)
city_cyc_col

## # A tibble: 6 x 8
##   GeoCode CyclistsInjured CyclistsKilled Bicycle     Pop Bicylce_per_Pop
##    <fctr>           <int>          <int>   <int>   <int>           <dbl>
## 1       C             477              1     606 8537673    7.097953e-05
## 2       M             130              0     183 1643734    1.113319e-04
## 3       B              57              0      68 1455720    4.671228e-05
## 4       K             188              1     235 2629150    8.938250e-05
## 5       Q              90              0     109 2333054    4.671988e-05
## 6       S              12              0      11  476015    2.310852e-05
## # ... with 2 more variables: Total_injured_killed <int>,
## #   Injured_killed_per_pop <dbl>

city_cyc_col$Bicylce_per_Pop %>% barplot(col=c("black", "red", "orange", "yellow", "green", "blue"), names.arg=city_cyc_col$GeoCode, ylab="Bicylce Collisions per Population")

city_cyc_col$Injured_killed_per_pop %>% barplot(col=c("black", "red", "orange", "yellow", "green", "blue"), names.arg=city_cyc_col$GeoCode,ylab="Bicylce Collision Injuries and Deaths per Population")

The Bicycle collision data differs from the motor vehicle collisions in that you have a higher chance of collision in Manhattan and Brooklyn after factoring in population than in the other boroughs. The overall measure for a bicycle collision are 40 to 90 parts per million, making such collisions 100 times less likely than vehicle collisions. These data suggest that Manhattan and Brooklyn have more cyclists than other Boroughs.

sum(city_cyc_col$Total_injured_killed[2:6]/city_cyc_col$Bicycle[2:6]/length(city_cyc_col[2:6]))

## [1] 0.8538941

Unlike vehicles where collisions are slightly unlike to cause injury, at 85.4% getting hit on a bicycle is very likely to result injury. Only 1 collisions resulted in death.

Pedestrian Collisions

Finally we look at the proportion of collisions that involve pedestrians.

city_ped_col <- city_burr %>% select(GeoCode, PedestrInjured,PedestrKilled,Pop) %>% mutate(Ped_Total = PedestrInjured+PedestrKilled, Ped_per_pop = Ped_Total/Pop)
city_ped_col

## # A tibble: 6 x 6
##   GeoCode PedestrInjured PedestrKilled     Pop Ped_Total  Ped_per_pop
##    <fctr>          <int>         <int>   <int>     <int>        <dbl>
## 1       C            705             7 8537673       712 8.339509e-05
## 2       M            139             3 1643734       142 8.638867e-05
## 3       B            131             0 1455720       131 8.998983e-05
## 4       K            241             1 2629150       242 9.204496e-05
## 5       Q            168             3 2333054       171 7.329449e-05
## 6       S             26             0  476015        26 5.462013e-05

city_ped_col$Ped_per_pop %>% barplot(col=c("black", "red", "orange", "yellow", "green", "blue"), names.arg=city_ped_col$GeoCode)

You have slightly less risk of getting hit as a pedestrian in Queens and Staten Island. As with cyclists the risk is measured in range of 45-90 parts per million making this mode of transport less risky than motor vehicles by a factor of 100.

Vechicle Type and Location

Now we will examine vehicle type involved in a collision and location.

nypd_type_loc <- "cityacc-en-us2.csv" %>% read.table(sep = ",", skip=4, nrows=898, quote = "\"") %>% tbl_df()
nypd_type_loc <- nypd_type_loc %>% rename(GeoCode=V1, GeoCodeLabel=V2, VehicleTypeCode=V3, VehicleTypeDescription=V4, Number_of_Collisions=V5)
tail(nypd_type_loc)

## # A tibble: 6 x 5
##   GeoCode   GeoCodeLabel VehicleTypeCode        VehicleTypeDescription
##    <fctr>         <fctr>          <fctr>                        <fctr>
## 1     123 123rd Precinct             MCY                    MOTORCYCLE
## 2     123 123rd Precinct            SEDN             PASSENGER VEHICLE
## 3     123 123rd Precinct            PICK                 PICK-UP TRUCK
## 4     123 123rd Precinct            SCOM        SMALL COM VEH(4 TIRES)
## 5     123 123rd Precinct            SUBN SPORT UTILITY / STATION WAGON
## 6     123 123rd Precinct             VAN                           VAN
## # ... with 1 more variables: Number_of_Collisions <int>

The vehicle codes and Descriptions are redundant so we can make a key and then drop the descriptions.

key <- nypd_type_loc %>% select(VehicleTypeCode,VehicleTypeDescription)
key <- key[1:15,]
key

## # A tibble: 15 x 2
##    VehicleTypeCode         VehicleTypeDescription
##             <fctr>                         <fctr>
##  1             ATV            ALL-TERRAIN VEHICLE
##  2            AMBU                      AMBULANCE
##  3            BIKE                        BICYCLE
##  4             BUS                            BUS
##  5            FIRE                     FIRE TRUCK
##  6            LCOM LARGE COM VEH(6 OR MORE TIRES)
##  7             MCY                     MOTORCYCLE
##  8            SEDN              PASSENGER VEHICLE
##  9            PICK                  PICK-UP TRUCK
## 10            SCOM         SMALL COM VEH(4 TIRES)
## 11            SUBN  SPORT UTILITY / STATION WAGON
## 12            TAXI                   TAXI VEHICLE
## 13             VAN                            VAN
## 14             OTH                          OTHER
## 15             UNK                        UNKNOWN

nypd_type_loc <- nypd_type_loc %>% select(GeoCode, GeoCodeLabel, VehicleTypeCode, Number_of_Collisions)
head(nypd_type_loc)

## # A tibble: 6 x 4
##   GeoCode GeoCodeLabel VehicleTypeCode Number_of_Collisions
##    <fctr>       <fctr>          <fctr>                <int>
## 1       C     CITYWIDE             ATV                    1
## 2       C     CITYWIDE            AMBU                  112
## 3       C     CITYWIDE            BIKE                  606
## 4       C     CITYWIDE             BUS                  464
## 5       C     CITYWIDE            FIRE                   19
## 6       C     CITYWIDE            LCOM                 1444

Now I want to make the vehicle code the columns so I can easier analyze by Vehicle type:

nypd_type_loc <- nypd_type_loc %>% spread(VehicleTypeCode,Number_of_Collisions)
#NA's in this case mean no collisions, so we can set them to 0.
nypd_type_loc[is.na(nypd_type_loc)] <- 0
nypd_type_loc

## # A tibble: 83 x 17
##    GeoCode           GeoCodeLabel  AMBU   ATV  BIKE   BUS  FIRE  LCOM
##  *  <fctr>                 <fctr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1     001           1st Precinct     0     0     8    18     0    43
##  2     005           5th Precinct     0     0    17     9     0    19
##  3     006           6th Precinct     1     0     8     5     3    18
##  4     007           7th Precinct     0     0     6     2     0     8
##  5     009           9th Precinct     4     0     8     1     0    14
##  6     010          10th Precinct     1     0     8    21     0    63
##  7     013          13th Precinct     1     0    16     9     0    21
##  8     014 Midtown South Precinct     4     0    10    26     0    57
##  9     017          17th Precinct     1     0    11    14     0    33
## 10     018 Midtown North Precinct     0     0    19    20     1    34
## # ... with 73 more rows, and 9 more variables: MCY <dbl>, OTH <dbl>,
## #   PICK <dbl>, SCOM <dbl>, SEDN <int>, SUBN <dbl>, TAXI <dbl>, UNK <dbl>,
## #   VAN <dbl>

# The GeoCode and GeoCode labels are also redundant We'll use the Lables
nypd_type_loc <- nypd_type_loc[,-1]
tail(nypd_type_loc)

## # A tibble: 6 x 16
##    GeoCodeLabel  AMBU   ATV  BIKE   BUS  FIRE  LCOM   MCY   OTH  PICK
##          <fctr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1         BRONX    27     0    68    57     3   267    66     1   206
## 2      CITYWIDE   112     1   606   464    19  1444   361     4  1249
## 3      BROOKLYN    42     1   235   135     6   386   108     0   343
## 4     MANHATTAN    28     0   183   180     6   433    76     3   284
## 5        QUEENS    13     0   109    82     3   321    96     0   342
## 6 STATEN ISLAND     2     0    11    10     1    37    15     0    74
## # ... with 6 more variables: SCOM <dbl>, SEDN <int>, SUBN <dbl>,
## #   TAXI <dbl>, UNK <dbl>, VAN <dbl>

par(las=2)
barplot(nypd_type_loc$TAXI, names.arg = nypd_type_loc$GeoCodeLabel, cex.names = 0.2)

Now we have the data organized where we can look at collisions by vehicle type and location. As an example I selected taxis. In NYC police precincts with lower numbers tend to be in Manhattan and we see lots of taxi collisions in Manhattan, far more than any other borough.