Introductory Statistics (CRN: 6896)



Objective

Today we learn about the R language and Rstudio. We familiarize ourselves with this environment and perform some basic operations.

 

What is R?

R is a programming language that can perform statistical analyses. It is open source, which means that it can be modified and improved by anyone. Of course, any modification has to go through multiple rounds of reviews and only those deemed necessary will be accepted. You can learn about it here. But you rarely need to change R itself. Instead, you are likely to use packages, which are essentially toolboxes that facilitate certain operations. If you analyse a certain type of data and re-use similar functions and data formats, you can place them in a package and make it available publicly. Then other people who find your package useful can contribute to your package as well. On top of these benefits, R is free. You can download it and use it anywhere and anytime.


What is Rstudio?

Rstudio is a development environment. Just as you can use the English language in MS Word or Google docs, you can use R language in different environments. But you probably type in your letters or papers or essays in a Word or Google doc because they provide you with tools that simplify your task, and allow you to do more than just type text. Similarly, Rstudio is a software that gives us useful tools when we are using R. It’s also free!


Getting started

Through the course, you will learn about many of rstudio’s tools and features. But one impressive feature that you’ll learn about in this very first step is that rstudio can be run on cloud… wow… What does this mean? It means, we don’t have to go to https://cran.r-project.org/ to download the R language, and we don’t have to go to https://rstudio.com to download the software and install it on our system. Instead, we simply go to https://rstudio.cloud/, and open an account. Then we can run rstudio in our browser (ie.e. Safari, Chrome, Firefox, etc).

So go to https://rstudio.cloud/ and see this:

And press on the Get started and open an account. Then you’ll see something like this:

And there. You have your own workspace. As it says on the right corner,

This your personal workspace, where you can create a virtually unlimited number of projects.

On the left side you have a few useful things:

The first one is your default workspace. You can create new workspaces to organize your projects. But what is a project? It is all of your data files, R Code, packages and whatnot that you created as part of a project in one unit.

Not only is this super helpful with organizing multiple projects, it allows you to share your projects with your team members. You can do this by clicking on Members where you’ll see this:

You can invite people via email, and set their access levels:

So here is what each of the options can do:

  • Admin: can manage membership and can view, edit and manage all projects in the space.
  • Moderator: can view, edit and manage all projects in the space.
  • Contributor: can create, edit and manage their own projects. This is the default.
  • Viewer: can view projects shared with everyone in the space.

Ok, so now let’s create a new project by, guess what, clicking on New Project button!

Here is an incredible feature: over time, you become used to a setting that you like to start your projects with — meaning you like certain packages and files to be present for new projects in a workspace. It can be excruciatingly painful to have to load these packages each time (not really but still). So what can we do in face of such an ordeal?

Rstudio has a solution. Once you create a project, load your packages and files, but before you write any code, cick on the cog icon on the upper right side of the screen. This will take you to space Settings. Here you can set this empty but pre-loaded project as base project (see below). Now, everytime you


Rmarkdown

Ok, so now that we are familiar with rstudio .cloud environmetn, let’s use it for something.

Let’s first talk a bit about statistical analysis, and creating reports in R. Often times, your job is to analyze a dataset, answer af ew questions, and write a report where you talk about what you found. Reports can’t be just a bunch of numbers and plots. You need to describe things in lay terms. Other people should be able to read your reprot and make sense of it. Doing this is not easy but as with anything else, you geet better overtime.

Fortunately, we have rstudio to help us. The document you’re reading is generated in rstudio. This document is in a format called R Markdown. This format allows you to create reports that contain the code you wrote for your analysis, the results the analysis, and the narrative that eplaines what you did. This could come in the form of introduction, findings, comments, strategy, etc. All of this will bein one place. A huge contribution of this is that it allows you to analyse and write at the same time.

There is a ton of material on Rmarkdown. You can see this one for reference: (https://bookdown.org/yihui/rmarkdown/). You can also use cheatsheets that show you how to create proper formatting. Rstudio.cloud has one here.

So a report in Rmarkdown has three types of content:

  • text
  • verbatim code
  • results

Text is obvious. You’re reading it now. You can make it look italic, bold, strikethrough. You can add equations in tex that look nice: \(A = \pi*r^{2}\). Pretty much anything you can do in a text editor like MS word or google docs is doable here.

The code you use in your analysis can also be part of your report. Look at the code cute single line code below:

plot(cars)

It is in a different format, in a box, and you might be able to hide it. You might say, well, why not just write code in text form? The reason is, the code you write is exectuable. It can be Run to produce results. You decide whether to show the code, its output or results in your report.

If we execute hte code above, and show the reuslts in the report but not the code, it’ll liook like this:

Some tips:

  • You can run a chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.
  • When you execute code within the notebook, the results appear beneath the code.
  • Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.
  • When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).
  • The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.


#### Lab Exercise:

Let’s load a dataset and run some commands. Tere is a dataset in the Lab 1 folder called Film_Permits.csv.

We can load it using the code below:

films<-read.csv("../Lab 1/Film_Permits.csv", header = T, sep = ",")

Note that there is no output. How do I know I opened the data file then? Well, on the top right window, under the Environment tab, there is a new line that says, films and 50728 obs. of 14 variables.

How about we take sneak peek of what the dataset looks like? Run the code below:

View(films)

Woha, see what happened? You should see somthing like this:

This is what the dataset looks like in memory. Spend sometime and get familiar with each variable. What do you see. What do you think wach variable means? How is it measured?

There are other ways to look at our dataset. For example, run the command below:

str(films)

Note that we can see the code’s output in the text. Have a look at what changed. I made a small change and now the output is showing in the text. Cna you spot the difference? We’ll get back to this later.

Ok, how about a summary?

summary(films)

summary is a very useful command. It basically tells you what each column in your dataset holds, their format, and the number of times each case occours in each column.

One other way to check your dataset is to look at the first few rows. Run this:

head(films)

Head command shows you the first five rows by default. You can ask for more. Run this:

head(films, 10)

Now, let’s run a the codes below, and write down what each of them do:

dim(films)
nrow(films)
ncol(films)
print(films$Borough)
table(films$Borough)
