library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.1
library(stringr)
## Warning: package 'stringr' was built under R version 3.4.1
First we pull the relevant data from the csv file, which I retrieved from https://catalog.data.gov/, which was posted by Durley Torres-Marin. In this case there are several tables on one file separated by text. I’ll use read.table as this has the capability to start and stop on certain lines. Some of the countries have commas in the name so we need to take care of that before we load.
Again, using windows Power-Shell >get-content .\daca_population_data.csv|%{$_ -replace “, ”, “;”}|set-content .\daca_population_data2.csv
In this case looking at the raw data was important because the table had wrapped large numbers in double quotes to use commas. I had to set the quote to " make sure the table read correctly.
We will begin by examining the countries of origin of DACA recipients.
daca_countries <- "daca_population_data2.csv" %>% read.table(sep = ",", skip=5, nrows=153, quote = "\"") %>% tbl_df()
#Now that we've read the data we can put the commas back in.
daca_countries$V1<-daca_countries$V1 %>% str_replace_all(";", ",")
#We can also remove the commas in the numbers and set then as integer.
daca_countries$V2<-daca_countries$V2 %>% str_replace_all( ",", "") %>% as.integer()
#Finally we can rename the columns.
daca_countries<-daca_countries %>% rename(Country=V1,Population=V2,Percent=V3)
head(daca_countries)
## # A tibble: 6 x 3
## Country Population Percent
## <chr> <int> <dbl>
## 1 NA NA
## 2 Total 689800 100.0
## 3 Mexico 548000 79.4
## 4 El Salvador 25900 3.7
## 5 Guatemala 17700 2.6
## 6 Honduras 16100 2.3
tail(daca_countries)
## # A tibble: 6 x 3
## Country Population Percent
## <chr> <int> <dbl>
## 1 Afghanistan 10 0.0
## 2 Azerbaijan 10 0.0
## 3 Moldova 10 0.0
## 4 Namibia 10 0.0
## 5 Other Countries 200 0.0
## 6 Not Available 1800 0.3
There are so many countries that the data is hard to visualize. So taking a subset may be useful.
par(las=2)
barplot(daca_countries$Percent,names.arg =daca_countries$Country,horiz = TRUE,cex.names = 0.1)
summary(daca_countries$Population[-2])
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 10 35 110 4570 350 548000 1
The summary of the Population data shows that the data are highly right-skewed with a median of 110 but a mean of 4570.
daca_highest_countries <- daca_countries %>% filter(Percent >= 0.3, Percent < 100.0)
## Warning: package 'bindrcpp' was built under R version 3.4.1
daca_highest_countries
## # A tibble: 18 x 3
## Country Population Percent
## <chr> <int> <dbl>
## 1 Mexico 548000 79.4
## 2 El Salvador 25900 3.7
## 3 Guatemala 17700 2.6
## 4 Honduras 16100 2.3
## 5 Peru 7420 1.1
## 6 Korea, South 7310 1.1
## 7 Brazil 5780 0.8
## 8 Ecuador 5460 0.8
## 9 Colombia 5020 0.7
## 10 Argentina 3970 0.6
## 11 Philippines 3880 0.6
## 12 India 2640 0.4
## 13 Jamaica 2640 0.4
## 14 Venezuela 2480 0.4
## 15 Dominican Republic 2430 0.4
## 16 Uruguay 1930 0.3
## 17 Trinidad & Tobago 1930 0.3
## 18 Not Available 1800 0.3
par(las=2)
barplot(daca_highest_countries$Percent,names.arg =daca_highest_countries$Country,horiz = TRUE,cex.names = 0.4)
daca_countries_no_outliers <- daca_countries %>% filter(Population <= 1000)
boxplot(daca_countries_no_outliers$Population)
We see from the bar plot that Mexico has be far the largest contribution to the DACA program, most of the other top contributors are from the Western Hemisphere with South Korea, India, and The Philippines being the only Eastern Hemisphere countries that contribute more than 0.3% of the DACA population. The box plot shows us that the majority of countries of origin in the DACA program have fewer than 1000 people.
We will now examine the States of Residence of DACA recipients.
daca_states <- "daca_population_data2.csv" %>% read.table(sep = ",", skip=174, nrows=59, quote = "\"") %>% tbl_df()
#Now that we've read the data we can put the commas back in.
daca_states$V1<-daca_states$V1 %>% str_replace_all(";", ",")
#We can also remove the commas in the numbers and set then as integer.
daca_states$V2<-daca_states$V2 %>% str_replace_all(",", "")%>% as.integer()
#Finally we can rename the columns.
daca_states<-daca_states %>% rename(State=V1,Population=V2,Percent=V3)
head(daca_states)
## # A tibble: 6 x 3
## State Population Percent
## <chr> <int> <dbl>
## 1 California 197900 28.7
## 2 Texas 113000 16.4
## 3 Illinois 35600 5.2
## 4 New York 32900 4.8
## 5 Florida 27000 3.9
## 6 Arizona 25500 3.7
tail(daca_states)
## # A tibble: 6 x 3
## State Population Percent
## <chr> <int> <dbl>
## 1 Guam 20 0.0
## 2 Northern Mariana Islands 10 0.0
## 3 Vermont 10 0.0
## 4 Other territories 10 0.0
## 5 Not Available 4180 0.6
## 6 NA NA
summary(daca_states$Population)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 10.0 352.5 4090.0 11892.9 8975.0 197900.0 1
Again, median and mean are offset meaning that the data are right skewed with the mean being higher than the median. This means that a few states are home to a larger pool of DACA recipients than other states. You can see several outliers with population >20,000 in the boxplot with the top two states omitted for clarity.
daca_states_no_outliers <- daca_states %>% filter(Population < 50000)
boxplot(daca_states_no_outliers$Population)
par(las=2)
daca_top_states <- daca_states %>% filter(Percent >= 2.4)
barplot(daca_top_states$Percent,names.arg =daca_top_states$State,horiz = TRUE,cex.names = 0.5)
We see that California and Texas have the largest population of DACA recipients. Given that both states were once part of Mexico, share a border with Mexico, and as a result have large communities of Hispanic people. In this category are Florida and Arizona. Furthermore, the list contains states with large cities, such as Illinois and New York. This is not a surprise result, given that the majority of DACA recipients come from Latin American countries, particularly Mexico. Immigrants will tend to settle in communities that have economic opportunity and with communities that share their native language and culture.
daca_cities <- "daca_population_data2.csv"%>% read.table(sep = ",", skip=248, nrows=91, quote = "\"")%>% tbl_df()
#Now that we've read the data we can put the commas back in.
daca_cities$V1<-daca_cities$V1 %>% str_replace_all(";", ",")
#We can also remove the commas in the numbers and set then as integer.
daca_cities$V2<-daca_cities$V2 %>% str_replace_all(",", "") %>% as.integer()
#Finally we can rename the columns.
daca_cities<-daca_cities %>% rename(Metro=V1,Population=V2,Percent=V3)
head(daca_cities)
## # A tibble: 6 x 3
## Metro Population Percent
## <chr> <int> <dbl>
## 1 Los Angeles-Long Beach-Anaheim,CA 89900 13.0
## 2 New York-Newark-Jersey City,NY-NJ-PA 47200 6.8
## 3 Dallas-Fort Worth-Arlington,TX 36700 5.3
## 4 Houston-The Woodlands-Sugar Land,TX 35800 5.2
## 5 Chicago-Naperville-Elgin,IL-IN-WI 34100 5.0
## 6 Riverside-San Bernardino-Ontario,CA 22300 3.2
tail(daca_cities)
## # A tibble: 6 x 3
## Metro Population Percent
## <chr> <int> <dbl>
## 1 Cincinnati,OH-KY-IN 1000 0.2
## 2 St. Louis,MO-IL 1000 0.2
## 3 Lexington-Fayette,KY 1000 0.2
## 4 Other CBSA 86700 12.6
## 5 Non-CBSA 12300 1.8
## 6 Not Available 6600 1.0
summary(daca_cities$Population)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 1600 2800 7580 6050 89900
Metropolitan areas of residence follow the same pattern as above. The difference in the mean and median suggest that a few metropolitan areas are home to a larger percentage of DACA recipients. These cities are in the states that contain the larger populations, due to the states history and proximity to the larger countries of origin.
par(las=2)
daca_top_cities <- daca_cities %>% filter(Percent >= 2.2)
barplot(daca_top_cities$Percent,names.arg =daca_top_cities$Metro,horiz = TRUE,cex.names = 0.2)
daca_top_cities
## # A tibble: 10 x 3
## Metro Population Percent
## <chr> <int> <dbl>
## 1 Los Angeles-Long Beach-Anaheim,CA 89900 13.0
## 2 New York-Newark-Jersey City,NY-NJ-PA 47200 6.8
## 3 Dallas-Fort Worth-Arlington,TX 36700 5.3
## 4 Houston-The Woodlands-Sugar Land,TX 35800 5.2
## 5 Chicago-Naperville-Elgin,IL-IN-WI 34100 5.0
## 6 Riverside-San Bernardino-Ontario,CA 22300 3.2
## 7 Phoenix-Mesa-Scottsdale,AZ 22000 3.2
## 8 Atlanta-Sandy Springs-Roswell,GA 15700 2.3
## 9 San Francisco-Oakland-Hayward,CA 15500 2.3
## 10 Other CBSA 86700 12.6
Once more we see a pattern that is consistent with the patterns above. The cities that have the most DACA recipients are in the states that have the most DACA recipients. The interesting feature of these data is that there are nearly as many DACA recipients living in smaller metropolitan areas, here lumped together as “Other CBSA - Core Based Statistical Area”
daca_sex <- "daca_population_data2.csv"%>% read.table(sep = ",", skip=356, nrows=3, quote = "\"") %>% tbl_df()
#Now that we've read the data we can put the commas back in.
#We can also remove the commas in the numbers and set then as integer.
daca_sex$V2<-daca_sex$V2 %>% str_replace_all(",", "")%>% as.integer()
#Finally we can rename the columns.
daca_sex<-daca_sex %>% rename(Sex=V1,Population=V2,Percent=V3)
daca_sex
## # A tibble: 3 x 3
## Sex Population Percent
## <fctr> <int> <dbl>
## 1 Female 362700 52.6
## 2 Male 326900 47.4
## 3 Not available 200 0.0
par(las=2)
barplot(daca_sex$Population,names.arg =daca_sex$Sex,horiz = TRUE,cex.names = 0.5)
Seems like there is a statistically significant larger amount of female DACA recipient than male. I’m curious as to why. To verify, according to US census data, https://www.census.gov/prod/cen2010/briefs/c2010br-03.pdf, males outnumber females in every age category < 24 years old, which as we will see shortly is the median age of the DACA population. Females account for 48.8% of the population < 24 years old in the US. http://www.r-tutor.com/elementary-statistics/hypothesis-testing/two-tailed-test-population-proportion
males <- 10319427+10389638+10579862+11303666+11014176
females <- 9881935+9959019+10097332+10736677+10571823
n = sum(daca_sex$Population)
p = daca_sex$Population[1]/n
p
## [1] 0.5258046
p0 = females/(males+females)
p0
## [1] 0.4887463
z = (p-p0)/sqrt(p0*(1-p0)/n)
z
## [1] 61.57253
The typical Z-score for significance at the 95% level is \(\pm 1.96\):
alpha = 0.05
z_alpha_half = qnorm(1-alpha/2)
z_alpha_half
## [1] 1.959964
There is a signifcant gender bias in the DACA program. My hypothesis as to why is that young Latin American Men are more likely to be arrested. From the DACA FAQ: “have not been convicted of a felony offense, a significant misdemeanor, or more than three misdemeanors of any kind;” http://www.immigrationequality.org/get-legal-help/our-legal-resources/path-to-status-in-the-u-s/daca-deferred-action-for-childhood-arrivals/
Hispanics make up 27.5% of NYC’s population (1) for example, but 36.1% of arrests(2), and men are nearly 3 times more likely to be arrested than women(3).
(1)https://en.wikipedia.org/wiki/Demographics_of_New_York_City
daca_age <- "daca_population_data2.csv" %>% read.table(sep = ",", skip=363, nrows=6, quote = "\"")%>% tbl_df()
#Now that we've read the data we can put the commas back in.
#We can also remove the commas in the numbers and set then as integer.
daca_age$V2<-daca_age$V2 %>% str_replace_all(",", "") %>% as.integer()
#Finally we can rename the columns.
daca_age<-daca_age %>% rename(Age=V1,Population=V2,Percent=V3)
daca_age
## # A tibble: 6 x 3
## Age Population Percent
## <fctr> <int> <dbl>
## 1 Under 16 2000 0.3
## 2 16-20 196500 28.5
## 3 21-25 253100 36.7
## 4 26-30 163400 23.7
## 5 31-36 74900 10.9
## 6 Not available 5 0.0
daca_age_stats <- "daca_population_data2.csv" %>% read.table(sep = ",", skip=370, nrows=3, quote = "\"")%>%tbl_df()
daca_age_stats<-daca_age_stats %>% rename(Stat=V1,Age=V2)
#We need to drop an empty column from this table
daca_age_stats <- daca_age_stats %>% select(-V3)
daca_age_stats
## # A tibble: 3 x 2
## Stat Age
## <fctr> <fctr>
## 1 Average Age 23.8
## 2 Median Age 23
## 3 Interquartile Range 20 to 27
par(las=2)
barplot(daca_age$Population,names.arg =daca_age$Age,horiz = TRUE,cex.names = 0.5)
The age requirements for DACA are:
You were under the age of 31 as of June 15, 2012; You entered the United States prior to your 16th birthday; You have resided in the United States since June 15, 2007 and currently are present in the U.S.; You were in the United States on June 15, 2012 and must be physically in the U.S. at the time of filing for your request for deferred action;
These data are as of September 4th 2017. The lack of DACA recipients under 16 is striking. This may be due to parents being reticent to sign their minor children up, out of fear of immigration action.
daca_mrd <- "daca_population_data2.csv" %>% read.table(sep = ",", skip=390, nrows=5, quote = "\"") %>% tbl_df()
#Now that we've read the data we can put the commas back in.
#We can also remove the commas in the numbers and set then as integer.
daca_mrd$V2<-daca_mrd$V2 %>% str_replace_all(",", "")%>% as.integer()
#Finally we can rename the columns.
daca_mrd<-daca_mrd %>% rename(Status=V1,Population=V2,Percent=V3)
daca_mrd
## # A tibble: 5 x 3
## Status Population Percent
## <fctr> <int> <dbl>
## 1 Single 571600 82.9
## 2 Married 105700 15.3
## 3 Divorced 7800 1.1
## 4 Widowed 300 0.1
## 5 Not available 4400 0.6
par(las=2)
barplot(daca_mrd$Percent,names.arg =daca_mrd$Status,horiz = TRUE,cex.names = 0.5)
DACA is targeted at young people, so the low marriage rate is not surprising. Given that it is young people that are married here, the low Divorce Rate is somewhat surprising. This may be a result of fairly new marriages and the large amount of Latinos in the DACA population, who tend to have lower Divorce Rates overall: http://www.healthymarriageinfo.org/research-and-policy/marriage-facts/culture/Hispanics-and-Latinos/index.aspx
My impression of these data is that the DACA program seems very successful for giving an opportunity for young undocumented immigrants a opportunity to become American Citizens. The bias in the Sex Distribution does warrent further investigation, as it may be due to institutional racism and sexism that may be external to the program itself. Otherwise there is no evidence in these data that the program merited suspension by the current administration.
For the next set of data I’ll be scraping the webpage, posted by me, as this was my research group when I was at Purdue: http://www.physics.purdue.edu/astro/MOJAVE/velocitytable.html which is a sample of Active Galactic Nuclei (AGN) called Blazars. A subclass that is better known are Quasars, in fact some Blazars are Quasars albeit more compact in size and luminous. What I am after is just the object’s names and their jet feature speeds in units of the speed of light (c). What makes Blazars a special type of AGN is that the central Super Massive Black holes eject plasma clouds (the jet features) at close to the speed of light. Since we are looking down the jet and length toward us gets contracted due to special relativity, these features appear to move faster than light in our frame of reference.
Please note that I am very familiar with loading data into a program using SQL databases and .csv files, as that was part of my job at Purdue. I have some PERL scripts posted to my github that are written just for that purpose. I really wanted to give web scraping a try, as I had never done it before.
First we will use rvest:
library(rvest)
## Warning: package 'rvest' was built under R version 3.4.2
## Loading required package: xml2
## Warning: package 'xml2' was built under R version 3.4.1
Next, using https://blog.rstudio.com/2014/11/24/rvest-easy-web-scraping-with-r/ as a guide, we will set up the web scraping:
url <- "http://www.physics.purdue.edu/astro/MOJAVE/velocitytable.html"
mojave <- url %>% read_html() %>% html_node(xpath='//*[@id="myTable"]') %>% html_table()
head(mojave)
## B1950\nName ID Epochs Flux Density (mJy) Distance (mas) Distance (pc)
## 1 0003+380 1 9 5 4.24 15.40
## 2 0003+380 2 7 17 1.81 6.57
## 3 0003+380 4 6 17 1.25 4.54
## 4 0003+380 5 9 39 0.76 2.77
## 5 0003+380 6 10 98 0.39 1.43
## 6 0003-066 2 5 222 1.05 5.12
## P.A. (deg.) Velocity P.A.(deg.) Vector Motion Offset (deg.)
## 1 120.5 97 ± 16 24 ± 16
## 2 112.6 119.7 ± 2.6 7.1 ± 2.7
## 3 114.8 208 ± 12 93 ± 12c
## 4 117.5 331 ± 119 146 ± 119
## 5 115.4 335 ± 41 141 ± 41
## 6 322.9 226.3 ± 4.8 96.6 ± 5.0c
## Speed (muas/y) Speed (c) Perp. Accel (muas/y2) Par. Accel (muas/y2)
## 1 157 ± 42 2.29±0.61
## 2 319 ± 22 4.64±0.32
## 3 36.3 ± 9.2 0.53±0.13
## 4 4.8 ± 7.2e 0.07±0.10
## 5 12.7 ± 8.3e 0.19±0.12
## 6 191 ± 15 4.09±0.33
## Ejection Epoch Reference epoch t_R.A. t_dec.
## 1 2008.805 1985.09 1888.98
## 2 2007.705 2001.68 2003.32
## 3 2009.535 2075.66 1992.93
## 4 2010.262 2300.95 2092.52
## 5 2009.897 2073.84 2023.64
## 6 1997.798 1993.56 2004.48
Now we want to get just the names and the Flux Density of the features (how bright they are at the telescope over a band of frequencies) and their distances in parsecs (how wide the features are from the jet feature to the core, note 1 pc = 3.25 light-years). Maybe there is a relationship between how bright a feature is and how big it is.
mojave_flux <- mojave %>% select(`B1950
Name`, `Flux Density (mJy)`,`Distance (mas)`,`Distance (pc)`)
#We need to drop the NA's since some sources to not have reliable distance to source measurements.
mojave_flux <- mojave_flux %>% drop_na()
head(mojave_flux)
## B1950\nName Flux Density (mJy) Distance (mas) Distance (pc)
## 1 0003+380 5 4.24 15.40
## 2 0003+380 17 1.81 6.57
## 3 0003+380 17 1.25 4.54
## 4 0003+380 39 0.76 2.77
## 5 0003+380 98 0.39 1.43
## 6 0003-066 222 1.05 5.12
#The first test looks at linear size of the feature vs. brightness at the telescope.
fit <- lm(log(mojave_flux$`Distance (pc)`, base = 10)~log(mojave_flux$`Flux Density (mJy)`, base =10))
# log-log transformations were done as the data did not appear linear un-transformed.
summary(fit)
##
## Call:
## lm(formula = log(mojave_flux$`Distance (pc)`, base = 10) ~ log(mojave_flux$`Flux Density (mJy)`,
## base = 10))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.36953 -0.30004 0.02323 0.34426 1.77375
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 1.41846 0.05687
## log(mojave_flux$`Flux Density (mJy)`, base = 10) -0.26923 0.02851
## t value Pr(>|t|)
## (Intercept) 24.944 <2e-16 ***
## log(mojave_flux$`Flux Density (mJy)`, base = 10) -9.445 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5762 on 1235 degrees of freedom
## Multiple R-squared: 0.06737, Adjusted R-squared: 0.06661
## F-statistic: 89.21 on 1 and 1235 DF, p-value: < 2.2e-16
plot(log(mojave_flux$`Flux Density (mJy)`,base=10), log(mojave_flux$`Distance (pc)`,base=10), col="blue")
abline(fit, col="red")
#Calculating distance from angular size is model dependant so it may be better to use angular size instead.
#Both Flux Density and Distance in milli-arc-seconds (1/3,600,000th of a degree) are direct measurments.
fit2 <- lm(log(mojave_flux$`Distance (mas)`,base = 10)~log(mojave_flux$`Flux Density (mJy)`,base=10))
summary(fit2)
##
## Call:
## lm(formula = log(mojave_flux$`Distance (mas)`, base = 10) ~ log(mojave_flux$`Flux Density (mJy)`,
## base = 10))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0045 -0.3087 -0.0217 0.2668 1.5226
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 0.81006 0.04302
## log(mojave_flux$`Flux Density (mJy)`, base = 10) -0.28279 0.02156
## t value Pr(>|t|)
## (Intercept) 18.83 <2e-16 ***
## log(mojave_flux$`Flux Density (mJy)`, base = 10) -13.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4359 on 1235 degrees of freedom
## Multiple R-squared: 0.1223, Adjusted R-squared: 0.1215
## F-statistic: 172 on 1 and 1235 DF, p-value: < 2.2e-16
plot(log(mojave_flux$`Flux Density (mJy)`,base=10), log(mojave_flux$`Distance (mas)`,base=10), col="violet")
abline(fit2, col="green")
There is a statistically significant tendency for smaller features to produce brighter sources. If We were preparing these data for publication, my next step would be to use non parametric correlation coefficients and partial correlation coefficients to factor out redshift, which is a proxy for distance to source, and the equations relating these measurements are likely to be non-linear. Additionally, this could just be due to our ability to resolve smaller objects that are closer.
I now want to look at the speeds of individual jet features.
mojave_speeds <- mojave %>% select(`B1950
Name`, `Flux Density (mJy)` , `Speed (c)`)
head(mojave_speeds)
## B1950\nName Flux Density (mJy) Speed (c)
## 1 0003+380 5 2.29±0.61
## 2 0003+380 17 4.64±0.32
## 3 0003+380 17 0.53±0.13
## 4 0003+380 39 0.07±0.10
## 5 0003+380 98 0.19±0.12
## 6 0003-066 222 4.09±0.33
We need to use the \(\pm\) symbol to separate out the uncertainty, and drop the missing data.
mjv_sep_speeds <- mojave_speeds %>% separate('Speed (c)', into= c("Speed (c)" , "Error (c)"), sep="±", remove=TRUE) %>% drop_na()
## Warning: Too few values at 58 locations: 14, 15, 66, 67, 68, 94, 95, 114,
## 115, 116, 117, 145, 146, 147, 148, 149, 150, 151, 152, 208, ...
mjv_sep_speeds$`Speed (c)` <- as.numeric(mjv_sep_speeds$`Speed (c)`)
mjv_sep_speeds$`Error (c)` <- as.numeric(mjv_sep_speeds$`Error (c)`)
mjv_sep_speeds$`Speed (c)` %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.140 4.890 6.537 9.800 39.700
head(mjv_sep_speeds)
## B1950\nName Flux Density (mJy) Speed (c) Error (c)
## 1 0003+380 5 2.29 0.61
## 2 0003+380 17 4.64 0.32
## 3 0003+380 17 0.53 0.13
## 4 0003+380 39 0.07 0.10
## 5 0003+380 98 0.19 0.12
## 6 0003-066 222 4.09 0.33
fit3 <- lm(mjv_sep_speeds$`Speed (c)`~ log(mjv_sep_speeds$`Flux Density (mJy)`,base=10))
summary(fit3)
##
## Call:
## lm(formula = mjv_sep_speeds$`Speed (c)` ~ log(mjv_sep_speeds$`Flux Density (mJy)`,
## base = 10))
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.709 -5.370 -1.681 3.317 33.191
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 6.1684 0.6386
## log(mjv_sep_speeds$`Flux Density (mJy)`, base = 10) 0.1931 0.3201
## t value Pr(>|t|)
## (Intercept) 9.659 <2e-16 ***
## log(mjv_sep_speeds$`Flux Density (mJy)`, base = 10) 0.603 0.546
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.471 on 1235 degrees of freedom
## Multiple R-squared: 0.0002946, Adjusted R-squared: -0.0005149
## F-statistic: 0.3639 on 1 and 1235 DF, p-value: 0.5465
plot(log(mjv_sep_speeds$`Flux Density (mJy)`,base=10), mjv_sep_speeds$`Speed (c)`, col="cyan")
abline(fit3, col = "magenta")
Since there are jet feature with 0 speed, we had to use a semi-log plot because \(\lim_{x \rightarrow 0} \log{x} = -\infty\). These data seem fairly scattered around, the trend line is flat, and there is no statistically significant relationship between jet feature speed and feature brightness. In other words there is no connection between the oscillation frequency of the plasma (brightness), and it’s bulk motion (jet speed).
For the last data set for this project we will analyze NYPD traffic data. This was posted to the discussion board by Yuen Chun Wong. Yeun asked that we look for:
For the specific data that Yuen posted, I see no data that measures time of collision. In fact these data are specific to September 2017. So, the first and last items are not measurable from these data. I am however interested in risk by cyclists and pedestrians so I will compare cyclist injury/death to motorist injury/death and then pedestrian injury/death to motorist injury/death. The tables that record vehicle type and location, and collision cause and location are very similar in how they were organized. I will focus on vehicle type and location, as the tidyr and dplyr tools to organize the data will be identical.
#The first row are the col titles so I cut them out and used rename instead.
nypd_traffic_coll <- "cityacc-en-us.csv" %>% read.table(sep = ",", skip=4, nrows=83, quote = "\"") %>% tbl_df()
nypd_traffic_coll
## # A tibble: 83 x 14
## V1 V2 V3 V4 V5 V6 V7 V8 V9
## <fctr> <fctr> <int> <int> <int> <int> <int> <int> <int>
## 1 C CITYWIDE 18727 37086 3701 2325 9 1648 2
## 2 M MANHATTAN 3916 7571 579 249 1 205 0
## 3 001 1st Precinct 288 561 42 24 0 12 0
## 4 005 5th Precinct 203 376 40 12 0 14 0
## 5 006 6th Precinct 130 249 18 6 0 2 0
## 6 007 7th Precinct 102 195 22 9 0 8 0
## 7 009 9th Precinct 126 241 21 9 0 6 0
## 8 010 10th Precinct 246 479 20 7 0 7 0
## 9 013 13th Precinct 220 414 47 16 0 11 0
## 10 014 Midtown South Precinct 323 630 28 9 0 2 0
## # ... with 73 more rows, and 5 more variables: V10 <int>, V11 <int>,
## # V12 <int>, V13 <int>, V14 <int>
nypd_traffic_coll <- nypd_traffic_coll %>% rename(GeoCode=V1, GeoCodeLabel=V2, Number_of_Motor_Vehicle_Collisions=V3,Vehicles_or_Motorists_Involved=V4, Injury_or_Fatal_Collisions=V5, MotoristsInjured=V6, MotoristsKilled=V7, PassengInjured=V8,PassengKilled=V9, CyclistsInjured=V10, CyclistsKilled=V11, PedestrInjured=V12,PedestrKilled=V13, Bicycle=V14)
head(nypd_traffic_coll)
## # A tibble: 6 x 14
## GeoCode GeoCodeLabel Number_of_Motor_Vehicle_Collisions
## <fctr> <fctr> <int>
## 1 C CITYWIDE 18727
## 2 M MANHATTAN 3916
## 3 001 1st Precinct 288
## 4 005 5th Precinct 203
## 5 006 6th Precinct 130
## 6 007 7th Precinct 102
## # ... with 11 more variables: Vehicles_or_Motorists_Involved <int>,
## # Injury_or_Fatal_Collisions <int>, MotoristsInjured <int>,
## # MotoristsKilled <int>, PassengInjured <int>, PassengKilled <int>,
## # CyclistsInjured <int>, CyclistsKilled <int>, PedestrInjured <int>,
## # PedestrKilled <int>, Bicycle <int>
We will compare risk by analyzing the percentage of deaths and injuries of each vehicle type to the number of collisions involving that vehicle.
First a general overview of the Whole city and the 5 Burroughs:
city_burr <- nypd_traffic_coll %>% filter(GeoCode=="C"|GeoCode=="M"|GeoCode=="Q"|GeoCode=="K"|GeoCode=="B"|GeoCode=="S")
city_burr
## # A tibble: 6 x 14
## GeoCode GeoCodeLabel Number_of_Motor_Vehicle_Collisions
## <fctr> <fctr> <int>
## 1 C CITYWIDE 18727
## 2 M MANHATTAN 3916
## 3 B BRONX 2979
## 4 K BROOKLYN 5367
## 5 Q QUEENS 5421
## 6 S STATEN ISLAND 1044
## # ... with 11 more variables: Vehicles_or_Motorists_Involved <int>,
## # Injury_or_Fatal_Collisions <int>, MotoristsInjured <int>,
## # MotoristsKilled <int>, PassengInjured <int>, PassengKilled <int>,
## # CyclistsInjured <int>, CyclistsKilled <int>, PedestrInjured <int>,
## # PedestrKilled <int>, Bicycle <int>
First we create a subset of the data, and then create a bar plot.
city_burr_col <- city_burr %>% select(GeoCode, Number_of_Motor_Vehicle_Collisions)
city_burr_col$Number_of_Motor_Vehicle_Collisions %>% barplot(col=c("black", "red", "orange", "yellow", "green", "blue"), names.arg=city_burr_col$GeoCode)
According to this plot Brooklyn and Queens have the most collisions but they are the most populated Boroughs. We will bring in some additional data, population in NYC by Borough, to help with this analysis:
url2 <- "https://www.citypopulation.de/php/usa-newyorkcity.php"
city_pop <- url2 %>% read_html() %>% html_node(xpath='//*[@id="ts"]') %>% html_table()
city_pop
## Name Status PopulationCensus1990-04-01
## 1 Bronx Borough 1,203,789
## 2 Brooklyn (Kings County) Borough 2,300,664
## 3 Manhattan (New York County) Borough 1,487,536
## 4 Queens Borough 1,951,598
## 5 Staten Island (Richmond County) Borough 378,977
## 6 New York City City 7,322,564
## PopulationCensus2000-04-01 PopulationCensus2010-04-01
## 1 1,332,244 1,385,107
## 2 2,465,689 2,504,706
## 3 1,538,096 1,585,874
## 4 2,229,394 2,230,545
## 5 443,762 468,730
## 6 8,009,185 8,174,962
## PopulationEstimate2016-07-01
## 1 1,455,720 <U+2192>
## 2 2,629,150 <U+2192>
## 3 1,643,734 <U+2192>
## 4 2,333,054 <U+2192>
## 5 476,015 <U+2192>
## 6 8,537,673
#I did the part below by hand as the two tables are ordered differently and it wasn't obvivous to me how to sort to get the same order.
city_burr$Pop <- c(city_pop[6,6], city_pop[3,6], city_pop[1,6] ,city_pop[2,6], city_pop[4,6], city_pop[5,6])
city_burr$Pop<-city_burr$Pop %>% str_replace_all(",", "")%>% as.integer()
city_burr<- city_burr %>% mutate(Collisions_per_Pop = Number_of_Motor_Vehicle_Collisions/Pop)
city_burr
## # A tibble: 6 x 16
## GeoCode GeoCodeLabel Number_of_Motor_Vehicle_Collisions
## <fctr> <fctr> <int>
## 1 C CITYWIDE 18727
## 2 M MANHATTAN 3916
## 3 B BRONX 2979
## 4 K BROOKLYN 5367
## 5 Q QUEENS 5421
## 6 S STATEN ISLAND 1044
## # ... with 13 more variables: Vehicles_or_Motorists_Involved <int>,
## # Injury_or_Fatal_Collisions <int>, MotoristsInjured <int>,
## # MotoristsKilled <int>, PassengInjured <int>, PassengKilled <int>,
## # CyclistsInjured <int>, CyclistsKilled <int>, PedestrInjured <int>,
## # PedestrKilled <int>, Bicycle <int>, Pop <int>,
## # Collisions_per_Pop <dbl>
city_burr$Collisions_per_Pop %>% barplot(col=c("black", "red", "orange", "yellow", "green", "blue"), names.arg=city_burr_col$GeoCode)
When we factor in the population of each borough, the collisions numbers are very similar with all of them around 2 parts per mille.
Not All accidents by car result in an injury or death. We will now look at car crashes that result in injury or death.
car_injury <- city_burr %>% select(GeoCode, Injury_or_Fatal_Collisions, MotoristsInjured, PassengInjured, MotoristsKilled, PassengKilled, Pop) %>% mutate(Total_per_pop = Injury_or_Fatal_Collisions/Pop, Total_injured = MotoristsInjured+PassengInjured, Total_injured_per_pop = Total_injured/Pop, Total_killed = MotoristsKilled + PassengKilled, Total_killed_per_pop = Total_killed/Pop, Killed_per_Injury = Total_killed/Total_injured)
car_injury
## # A tibble: 6 x 13
## GeoCode Injury_or_Fatal_Collisions MotoristsInjured PassengInjured
## <fctr> <int> <int> <int>
## 1 C 3701 2325 1648
## 2 M 579 249 205
## 3 B 663 448 335
## 4 K 1174 679 544
## 5 Q 1106 811 492
## 6 S 179 138 72
## # ... with 9 more variables: MotoristsKilled <int>, PassengKilled <int>,
## # Pop <int>, Total_per_pop <dbl>, Total_injured <int>,
## # Total_injured_per_pop <dbl>, Total_killed <int>,
## # Total_killed_per_pop <dbl>, Killed_per_Injury <dbl>
car_injury$Total_per_pop %>% barplot(col=c("black", "red", "orange", "yellow", "green", "blue"), names.arg=car_injury$GeoCode ,ylab = "Injuries and Deathes per Population")
car_injury$Total_injured_per_pop %>% barplot(col=c("black", "red", "orange", "yellow", "green", "blue"), names.arg=car_injury$GeoCode ,ylab = "Injuries per Population")
car_injury$Total_killed_per_pop %>% barplot(col=c("black", "red", "orange", "yellow", "green", "blue"), names.arg=car_injury$GeoCode ,ylab = "Deathes per Population")
car_injury$Killed_per_Injury %>% barplot(col=c("black", "red", "orange", "yellow", "green", "blue"), names.arg=car_injury$GeoCode ,ylab = "Deaths per Injury")
According to these data about 1:4 car collisions result in injury or death, and 6:10,000 result in death of a motorist or passenger.
#averaged over all boroughs, city totals factored out of vectors.
length(city_burr[2:6])
## [1] 5
sum(city_burr$Injury_or_Fatal_Collisions[2:6]/city_burr$Number_of_Motor_Vehicle_Collisions[2:6])/length(city_burr[2:6])
## [1] 0.1929269
sum(car_injury$Total_killed[2:6]/city_burr$Number_of_Motor_Vehicle_Collisions[2:6])/length(city_burr[2:6])
## [1] 0.0005719659
Our first step again is to create an appropriate subset of the data and create a bar plot.
city_cyc_col <- city_burr %>% select(GeoCode, CyclistsInjured, CyclistsKilled, Bicycle, Pop) %>% mutate(Bicylce_per_Pop=Bicycle/Pop, Total_injured_killed = CyclistsInjured+CyclistsKilled, Injured_killed_per_pop = Total_injured_killed/Pop)
city_cyc_col
## # A tibble: 6 x 8
## GeoCode CyclistsInjured CyclistsKilled Bicycle Pop Bicylce_per_Pop
## <fctr> <int> <int> <int> <int> <dbl>
## 1 C 477 1 606 8537673 7.097953e-05
## 2 M 130 0 183 1643734 1.113319e-04
## 3 B 57 0 68 1455720 4.671228e-05
## 4 K 188 1 235 2629150 8.938250e-05
## 5 Q 90 0 109 2333054 4.671988e-05
## 6 S 12 0 11 476015 2.310852e-05
## # ... with 2 more variables: Total_injured_killed <int>,
## # Injured_killed_per_pop <dbl>
city_cyc_col$Bicylce_per_Pop %>% barplot(col=c("black", "red", "orange", "yellow", "green", "blue"), names.arg=city_cyc_col$GeoCode, ylab="Bicylce Collisions per Population")
city_cyc_col$Injured_killed_per_pop %>% barplot(col=c("black", "red", "orange", "yellow", "green", "blue"), names.arg=city_cyc_col$GeoCode,ylab="Bicylce Collision Injuries and Deaths per Population")
The Bicycle collision data differs from the motor vehicle collisions in that you have a higher chance of collision in Manhattan and Brooklyn after factoring in population than in the other boroughs. The overall measure for a bicycle collision are 40 to 90 parts per million, making such collisions 100 times less likely than vehicle collisions. These data suggest that Manhattan and Brooklyn have more cyclists than other Boroughs.
sum(city_cyc_col$Total_injured_killed[2:6]/city_cyc_col$Bicycle[2:6]/length(city_cyc_col[2:6]))
## [1] 0.8538941
Unlike vehicles where collisions are slightly unlike to cause injury, at 85.4% getting hit on a bicycle is very likely to result injury. Only 1 collisions resulted in death.
Finally we look at the proportion of collisions that involve pedestrians.
city_ped_col <- city_burr %>% select(GeoCode, PedestrInjured,PedestrKilled,Pop) %>% mutate(Ped_Total = PedestrInjured+PedestrKilled, Ped_per_pop = Ped_Total/Pop)
city_ped_col
## # A tibble: 6 x 6
## GeoCode PedestrInjured PedestrKilled Pop Ped_Total Ped_per_pop
## <fctr> <int> <int> <int> <int> <dbl>
## 1 C 705 7 8537673 712 8.339509e-05
## 2 M 139 3 1643734 142 8.638867e-05
## 3 B 131 0 1455720 131 8.998983e-05
## 4 K 241 1 2629150 242 9.204496e-05
## 5 Q 168 3 2333054 171 7.329449e-05
## 6 S 26 0 476015 26 5.462013e-05
city_ped_col$Ped_per_pop %>% barplot(col=c("black", "red", "orange", "yellow", "green", "blue"), names.arg=city_ped_col$GeoCode)
You have slightly less risk of getting hit as a pedestrian in Queens and Staten Island. As with cyclists the risk is measured in range of 45-90 parts per million making this mode of transport less risky than motor vehicles by a factor of 100.
Now we will examine vehicle type involved in a collision and location.
nypd_type_loc <- "cityacc-en-us2.csv" %>% read.table(sep = ",", skip=4, nrows=898, quote = "\"") %>% tbl_df()
nypd_type_loc <- nypd_type_loc %>% rename(GeoCode=V1, GeoCodeLabel=V2, VehicleTypeCode=V3, VehicleTypeDescription=V4, Number_of_Collisions=V5)
tail(nypd_type_loc)
## # A tibble: 6 x 5
## GeoCode GeoCodeLabel VehicleTypeCode VehicleTypeDescription
## <fctr> <fctr> <fctr> <fctr>
## 1 123 123rd Precinct MCY MOTORCYCLE
## 2 123 123rd Precinct SEDN PASSENGER VEHICLE
## 3 123 123rd Precinct PICK PICK-UP TRUCK
## 4 123 123rd Precinct SCOM SMALL COM VEH(4 TIRES)
## 5 123 123rd Precinct SUBN SPORT UTILITY / STATION WAGON
## 6 123 123rd Precinct VAN VAN
## # ... with 1 more variables: Number_of_Collisions <int>
The vehicle codes and Descriptions are redundant so we can make a key and then drop the descriptions.
key <- nypd_type_loc %>% select(VehicleTypeCode,VehicleTypeDescription)
key <- key[1:15,]
key
## # A tibble: 15 x 2
## VehicleTypeCode VehicleTypeDescription
## <fctr> <fctr>
## 1 ATV ALL-TERRAIN VEHICLE
## 2 AMBU AMBULANCE
## 3 BIKE BICYCLE
## 4 BUS BUS
## 5 FIRE FIRE TRUCK
## 6 LCOM LARGE COM VEH(6 OR MORE TIRES)
## 7 MCY MOTORCYCLE
## 8 SEDN PASSENGER VEHICLE
## 9 PICK PICK-UP TRUCK
## 10 SCOM SMALL COM VEH(4 TIRES)
## 11 SUBN SPORT UTILITY / STATION WAGON
## 12 TAXI TAXI VEHICLE
## 13 VAN VAN
## 14 OTH OTHER
## 15 UNK UNKNOWN
nypd_type_loc <- nypd_type_loc %>% select(GeoCode, GeoCodeLabel, VehicleTypeCode, Number_of_Collisions)
head(nypd_type_loc)
## # A tibble: 6 x 4
## GeoCode GeoCodeLabel VehicleTypeCode Number_of_Collisions
## <fctr> <fctr> <fctr> <int>
## 1 C CITYWIDE ATV 1
## 2 C CITYWIDE AMBU 112
## 3 C CITYWIDE BIKE 606
## 4 C CITYWIDE BUS 464
## 5 C CITYWIDE FIRE 19
## 6 C CITYWIDE LCOM 1444
Now I want to make the vehicle code the columns so I can easier analyze by Vehicle type:
nypd_type_loc <- nypd_type_loc %>% spread(VehicleTypeCode,Number_of_Collisions)
#NA's in this case mean no collisions, so we can set them to 0.
nypd_type_loc[is.na(nypd_type_loc)] <- 0
nypd_type_loc
## # A tibble: 83 x 17
## GeoCode GeoCodeLabel AMBU ATV BIKE BUS FIRE LCOM
## * <fctr> <fctr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 001 1st Precinct 0 0 8 18 0 43
## 2 005 5th Precinct 0 0 17 9 0 19
## 3 006 6th Precinct 1 0 8 5 3 18
## 4 007 7th Precinct 0 0 6 2 0 8
## 5 009 9th Precinct 4 0 8 1 0 14
## 6 010 10th Precinct 1 0 8 21 0 63
## 7 013 13th Precinct 1 0 16 9 0 21
## 8 014 Midtown South Precinct 4 0 10 26 0 57
## 9 017 17th Precinct 1 0 11 14 0 33
## 10 018 Midtown North Precinct 0 0 19 20 1 34
## # ... with 73 more rows, and 9 more variables: MCY <dbl>, OTH <dbl>,
## # PICK <dbl>, SCOM <dbl>, SEDN <int>, SUBN <dbl>, TAXI <dbl>, UNK <dbl>,
## # VAN <dbl>
# The GeoCode and GeoCode labels are also redundant We'll use the Lables
nypd_type_loc <- nypd_type_loc[,-1]
tail(nypd_type_loc)
## # A tibble: 6 x 16
## GeoCodeLabel AMBU ATV BIKE BUS FIRE LCOM MCY OTH PICK
## <fctr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 BRONX 27 0 68 57 3 267 66 1 206
## 2 CITYWIDE 112 1 606 464 19 1444 361 4 1249
## 3 BROOKLYN 42 1 235 135 6 386 108 0 343
## 4 MANHATTAN 28 0 183 180 6 433 76 3 284
## 5 QUEENS 13 0 109 82 3 321 96 0 342
## 6 STATEN ISLAND 2 0 11 10 1 37 15 0 74
## # ... with 6 more variables: SCOM <dbl>, SEDN <int>, SUBN <dbl>,
## # TAXI <dbl>, UNK <dbl>, VAN <dbl>
par(las=2)
barplot(nypd_type_loc$TAXI, names.arg = nypd_type_loc$GeoCodeLabel, cex.names = 0.2)
Now we have the data organized where we can look at collisions by vehicle type and location. As an example I selected taxis. In NYC police precincts with lower numbers tend to be in Manhattan and we see lots of taxi collisions in Manhattan, far more than any other borough.