I will first start by running Python in my R notebook. The only drawback of using python within r is the lack of feedback from the interpreter. This Python set up is ideal if there are specific tasks that the user prefers using python.
x = 'hello, python world! from R notebook'
print(x)
hello, python world! from R notebook
I’m currently running python version 3.6, but specific engine paths can be set up.
import sys
print(sys.version)
3.6.3 |Anaconda, Inc.| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)]
I find Web scraping in python to be far superior to the methods in R.
This python code below gets a URL and extracts all paragraphs from the online article. The code attempts to remove various characters.The final output can be transformed to sentiment analysis or text mining. I prefer to do the analysis using R packages, but the extraction is far superior in a python environment.
from bs4 import BeautifulSoup
import requests
import re
url = "http://www.recode.net/2017/2/24/14727106/faceebook-mark-zuckerberg-manifesto-government-news-media-power"
#urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
response = requests.get(url,headers= {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
soup = BeautifulSoup(response.content, 'html.parser')
information = soup.find_all('p', {'id': re.compile("^")})
# I force the results into a string in order to run regex and remove HTML tags etc.
info = str(information)
cleanData = re.sub(r"<.*?>","",info)
cleanData = re.sub(r"\\u\d*\D", " ", cleanData)
cleanData = re.sub(r"\s", " ", cleanData)
print(cleanData)
[A version of this essay was originally published at Tech.pinions, a website dedicated to informed opinions, insight and perspective on the tech industry., Last week, Mark Zuckerberg posted on Facebook a combination of a personal and company manifesto. He also spoke to a number of reporters regarding it. The manifesto is long, and it covers a ton of ground, some of it about the state of the world, but much of it, at least indirectly and directly, about Facebook and its role in such a world. The manifesto is notable for its concession that Facebook has enormous power and has, in some ways, contributed to some big problems plaguing the world. But, more worryingly, it seems to think the solution is more Facebook., There has been rising concern about Facebook<U+393C><U+3E32>s power over many facets of our lives for years now, and the concern is especially strong when it comes to news and media consumption, where Facebook is becoming an ever more important channel. Because Facebook<U+393C><U+3E32>s algorithms determine which things users could be shown, Facebook bears a primary responsibility for making decisions about the media world its users live in., Facebook<U+393C><U+3E32>s incentives are to show people the things they<U+393C><U+3E32>re most likely to enjoy, engage with and share with their friends. But the assumption is that this means showing them things that fit with their existing views, rather than challenging them. It means it often ends up creating so-called <U+393C><U+3E33>filter bubbles<U+393C><U+3E34> in which people are only ever exposed to media that confirms their existing views, and only rarely to contradictory views., Zuckerberg<U+393C><U+3E32>s manifesto acknowledges all of this, but proposes solutions that are focused on Facebook itself, rather than on weaning people off their reliance on Facebook. That<U+393C><U+3E32>s understandable <U+393C><U+3E37> his job is to get people to use Facebook more rather than less but, of course, this approach merely reinforces Facebook<U+393C><U+3E32>s power and potentially even increases it as it takes a more active role in showing people a range of content. This is a theme that flows throughout the post, talking about all the things Facebook can do to take an even bigger and stronger role in the lives of its users., Nowhere is this more striking than when he starts talking about participation in the democratic process:, The second is establishing a new process for citizens worldwide to participate in collective decision-making. Our world is more connected than ever, and we face global problems that span national boundaries. As the largest global community, Facebook can explore examples of how community governance might work at scale., That, to me, sounds like Zuckerberg envisions a world in which Facebook itself becomes the medium through which communities (i.e., cities, states, countries) would govern themselves. Given existing concerns about Facebook<U+393C><U+3E32>s power to shape media consumption, the idea that it would take a direct role in governance (rather than merely allowing people to vote or connect with their elected representatives as it has done in the past) should be terrifying., It<U+393C><U+3E32>s arguable that even Facebook<U+393C><U+3E32>s <U+393C><U+3E33>Get Out the Vote<U+393C><U+3E34> efforts have potential to distort the democratic process, given that usage skews younger than the overall population. But at least it doesn<U+393C><U+3E32>t give Facebook a direct role in the democratic process itself. If I were a local government, I<U+393C><U+3E32>d be extremely wary of allowing Facebook a deeper role in any of these processes <U+393C><U+3E37> I think it<U+393C><U+3E32>s time for both individuals and organizations to push back against Facebook<U+393C><U+3E32>s enormous power rather than embracing an expansion of it., But this concern should go beyond just the democratic process and institutions <U+393C><U+3E37> we should all be thinking about how much power we want Facebook to have over our lives. A line that was removed from the manifesto between when a draft was sent out to reporters and when the final version was published on Facebook hints at some other dangers. That line concerned the use of AI to detect terrorism:, The long term promise of AI is that in addition to identifying risks more quickly and accurately than would have already happened, it may also identify risks that nobody would have flagged at all <U+393C><U+3E37> including terrorists planning attacks using private channels, people bullying someone too afraid to report it themselves, and other issues both local and global. It will take many years to develop these systems., On the face of it, this seems great <U+393C><U+3E37> Facebook would be helping to identify those who would hurt others while they<U+393C><U+3E32>re still in the planning stages. But it refers to terrorists using private channels, which implies Facebook looking into the contents of private messages shared between users on Facebook<U+393C><U+3E32>s various platforms. This is yet another area where Facebook<U+393C><U+3E32>s power is already considerable <U+393C><U+3E37> not only does it control much of our media consumption, but it also hosts and carries much of our communication via four huge platforms: Facebook itself, Messenger, WhatsApp and Instagram., Facebook<U+393C><U+3E32>s instincts here are understandable, but also worrying. It finally recognizes its power and the ways in which that power has caused problems in the world, but its instinct is to wield that power even more, rather than back off. Given that Facebook seems unlikely to police itself, it<U+393C><U+3E32>s up to its users and other organizations to start to exert pressure for it to do so., Jan Dawson is founder and chief analyst at Jackdaw, a technology research and consulting firm focused on the confluence of consumer devices, software, services and connectivity. During his 13 years as a technology analyst, Dawson has covered everything from DSL to LTE, and from policy and regulation to smartphones and tablets. Prior to founding Jackdaw, Dawson worked at Ovum for a number of years, most recently as chief telecoms analyst, responsible for Ovum<U+393C><U+3E32>s telecoms research agenda globally. Reach him @jandawson.]
Feather is a library/package that allows communication between many different environments, but in this instance, it will be used to write the pandas data frame to the project folder and then load the feather file into the R environment.
import pandas as pd
import feather
#pip install feather-format
url_1 = 'https://raw.githubusercontent.com/chrisestevez/DataAnalytics/master/Data/MedicareOpioidPrescriber2014Reduced.csv'
data = pd.read_csv(url_1, names = ['ZIP', 'State', 'Prov_Specialty','Total_Claims','Opioid_Claims','Pres_Rate'], skiprows=1,converters={'ZIP': lambda x: str(x)})
feather.write_dataframe(data, "Opioid_data")
print(data.head(10))
ZIP State Prov_Specialty Total_Claims \
0 48183 MI Emergency Medicine 124
1 96819 HI Urology 936
2 49546 MI Ophthalmology 590
3 95376 CA Dentist 46
4 62220 IL General Surgery 85
5 95116 CA Physical Medicine and Rehabilitation 472
6 55431 MN Orthopedic Surgery 142
7 94599 CA Family Practice 2321
8 62901 IL Emergency Medicine 199
9 53098 WI Obstetrics/Gynecology 152
Opioid_Claims Pres_Rate
0 33 0.2661
1 32 0.0342
2 11 0.0186
3 11 0.2391
4 45 0.5294
5 46 0.0975
6 116 0.8169
7 104 0.0448
8 61 0.3065
9 12 0.0789
After the Feather file has been written, I will read it into my R environment and display the first ten rows.
if (!require("feather")) install.packages("feather")
library("feather")
Py_opioid_data = read_feather("Opioid_data")
head(Py_opioid_data,10)
The code below uses a specific version of the package by giving the specified repository location to the install package. Another option to ensure consistency is to use packrat to isolate packages used in the analysis.
pack_URL= "https://cran.r-project.org/src/contrib/Archive/DT/DT_0.2.tar.gz"
if (!require("DT")) install.packages(pack_URL, repos=NULL, type='source')
library("DT")
The DT package allows for the creation of HTML tables with many options such as sharing and formatting of the underlying data.Should the report require output in a word or pdf format, I suggest using kableExtra that allows for the creation of LaTex formatted tables.
reduced_df =head(Py_opioid_data,100)
datatable(reduced_df, filter = 'top', options = list(
pageLength = 15, autoWidth = TRUE,columnDefs = list(list(className = 'dt-center', targets="_all"))
), rownames = FALSE) %>%
formatPercentage('Pres_Rate', 2)
Using mlbench and caret, the code below will attempt to create a logistic regression model to predict diabetes using the PimaIndiansDiabetes dataset. Two models will be built an SVM and a GLM model set.seed has been used to ensure consistency when creating both models. Please note that the example below does not demonstrate the complete process of fitting a logistic regression model.
if (!require("mlbench")) install.packages("mlbench")
if (!require("caret")) install.packages("caret")
library("caret")
library("mlbench")
Data = data(PimaIndiansDiabetes)
control = trainControl(method="repeatedcv", number=10, repeats=3,savePredictions = TRUE)
set.seed(7)
modelSvm = train(diabetes~., data=PimaIndiansDiabetes, method="svmRadial", trControl=control)
set.seed(7)
modelglm=train(diabetes~., data=PimaIndiansDiabetes, method="glm", trControl=control)
# combination of both results
results = resamples(list( SVM=modelSvm,GLM=modelglm))
summary(results)
Call:
summary.resamples(object = results)
Models: SVM, GLM
Number of resamples: 30
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
SVM 0.6974 0.7305 0.7662 0.7665 0.7922 0.8442 0
GLM 0.7143 0.7435 0.7778 0.7752 0.8000 0.8442 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
SVM 0.2517 0.3670 0.4590 0.4500 0.5211 0.6457 0
GLM 0.3267 0.4104 0.4878 0.4787 0.5488 0.6457 0
bwplot(results)
dotplot(results)
I will predict on the PimaIndiansDiabetes dataset. The DT package allows for simple data sharing in various format by creating buttons underneath the HTML tables. This quickly allows my predicted results to be shared and compared for benchmarking with other models.
PimaIndiansDiabetes$MyModelPred = predict(modelglm,PimaIndiansDiabetes)
datatable(
PimaIndiansDiabetes, extensions = 'Buttons', options = list(
dom = 'Bfrtip',
buttons = c('csv', 'pdf'),columnDefs = list(list(className = 'dt-center', targets="_all"))
), rownames = FALSE)
R notebook allows for the blending of Latex and text simultaneously this helps expressed complicated expressions.
The inclusion Smith is in jail and has 1 dollar; he can get out on bail if he has 8 dollars. A guard agrees to make a series of bets with him. If Smith bets A dollars, he wins A dollars with probability .4 and loses A dollars with probability .6. Find the probability that he wins 8 dollars before losing all of his money if
bets 1 dollar each time (timid strategy)
\[qz = \frac{(q/p)^z-1}{(q/p)^M-1}\]
\[qz = \frac{(.6/.4)^1-1}{(.6/.4)^8-1} \] \[qz = \frac{.5}{24.63} = .02 \% \]
((.6/.4)^1-1) / ((.6/.4)^8-1)
[1] 0.02030135
Session info prints the version information of R, the OS, and attached or loaded packages.
sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] caret_6.0-78 ggplot2_2.2.1 lattice_0.20-33 mlbench_2.1-1
[5] DT_0.2 feather_0.3.1
loaded via a namespace (and not attached):
[1] ddalpha_1.3.1.1 tidyr_0.7.1 sfsmisc_1.1-2
[4] jsonlite_1.0 splines_3.3.1 foreach_1.4.3
[7] prodlim_1.5.7 assertthat_0.1 stats4_3.3.1
[10] DRR_0.0.3 yaml_2.1.14 robustbase_0.92-8
[13] ipred_0.9-5 backports_1.0.5 glue_1.1.1
[16] digest_0.6.10 colorspace_1.2-6 recipes_0.1.2
[19] htmltools_0.3.5 Matrix_1.2-6 plyr_1.8.4
[22] psych_1.7.8 timeDate_3042.101 pkgconfig_2.0.1
[25] CVST_0.2-1 broom_0.4.3 purrr_0.2.3
[28] scales_0.4.1 gower_0.1.2 lava_1.4.5
[31] tibble_1.3.4 withr_2.1.1 nnet_7.3-12
[34] lazyeval_0.2.0 mnormt_1.5-5 survival_2.39-4
[37] magrittr_1.5 evaluate_0.10.1 nlme_3.1-128
[40] MASS_7.3-45 dimRed_0.1.0 foreign_0.8-69
[43] class_7.3-14 rsconnect_0.7 tools_3.3.1
[46] hms_0.4.2 stringr_1.2.0 kernlab_0.9-25
[49] munsell_0.4.3 bindrcpp_0.2 compiler_3.3.1
[52] e1071_1.6-7 RcppRoll_0.2.2 rlang_0.1.2
[55] grid_3.3.1 iterators_1.0.8 htmlwidgets_0.9
[58] base64enc_0.1-3 rmarkdown_1.6 gtable_0.2.0
[61] ModelMetrics_1.1.0 codetools_0.2-14 reshape2_1.4.1
[64] R6_2.1.3 lubridate_1.6.0 knitr_1.17
[67] dplyr_0.7.2 bindr_0.1 rprojroot_1.2
[70] stringi_1.1.1 parallel_3.3.1 Rcpp_0.12.12
[73] rpart_4.1-10 DEoptimR_1.0-8 tidyselect_0.2.0
Articles:
https://fivethirtyeight.com/features/as-a-major-retraction-shows-were-all-vulnerable-to-faked-data/
http://web.stanford.edu/~dbroock/broockman_kalla_aronow_lg_irregularities.pdf
https://simpleprogrammer.com/what-makes-code-readable-not-what-you-think/
http://onsnetwork.org/chartgerink/2017/03/30/reproducible-manuscripts-are-the-future/
Sources https://rmarkdown.rstudio.com/authoring_knitr_engines.html
https://github.com/wesm/feather/tree/master/python
https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
https://xiangxing98.github.io/R_Learning/R_Reproducible.nb.html
https://rpubs.com/marschmi/105639
http://kbroman.org/knitr_knutshell/pages/reproducible.html
https://www.r-bloggers.com/reproducible-research-training-wheels-and-knitr/