Chapter 1 Project - Intro to R & RStudio

Instructions

Update the author line at the top to have your name in it.
You must knit this document to an html file and publish it to RPubs. Once you have published your project to the web, you must submit the web url link into the appropriate Course Project assignment in Blackboard before 11:59pm on the due date.
Answer all the following questions completely. Some may ask for written responses.
Use R chunks for code to be evaluated where needed and always comment all of your code so the reader can understand what your code aims to accomplish.
Proofread your knitted document before publishing it to ensure it looks the way you want it to. Tip: Use double spaces at the end of a line to create a line break and make sure normal text does not appear as a header.

Purpose

This first project will introduce you to creating RMarkdown files and doing basic data manipulation with R and RStudio.

Question 1

In a single r chunk, perform two operations: first add the numbers 9 and 23, then divide the number 42 by the sum of 83 and 101. State both answers in complete sentences beneath your r chunk, with your second answer rounded to three decimal places.

#This code adds 9 plus 23.
9 + 23

## [1] 32

#This code divides 42 by the sum of 83 and 101.
42/(83 + 101)

## [1] 0.2282609

The sum of 9 and 23 is 32.

The number 42 divided by the sum of 83 and 101 is 0.228 when rounded to three decimal places.

Question 2

In an r chunk below, store the dataframe mtcars into an object called mtcars and use the head() function to look at the first 6 rows of that object.

#This code saves the dataframe 'mtcars' into our environment as an object called "mtcars". 
mtcars <- mtcars

#This code allows us to look at the first 6 rows of "mtcars".
head(mtcars, n = 6)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Use the ?mtcars command in the RStudio Console to pull up the help documentation to find out what the qsec variable describes in the mtcars dataset, then write your answer in a complete sentence. (No r chunk is required here).

The ‘qsec’ variable of the mtcars data set describes the time, in seconds, that it takes for the car to travel a distance of 1/4 mile.

Question 3

To load a package into R requires the use of the install.packages() and library() functions. Type: install.packages(“openintro”) into the RStudio Console to install that package on to your computer. The install.packages() function requires a character input, so ” ” are needed around the input.

The library() function takes in an object name, so do not use ” ” around the object name.

Use an r chunk below to load the library openintro which contains some data sets we will use going in subsequent problems.

#This code allows us to load the object called "openintro" so that we may observe the datasets within it. 

library(openintro)

## Loading required package: airports

## Loading required package: cherryblossom

## Loading required package: usdata

With the openintro library loaded, you can now access some datasets from it. In another r chunk, find the dimensions of the email data set by using the dim() function on it.

#This code allows us to find the dimensions of the 'email' data set. 
dim(email)

## [1] 3921   21

The dimensions of the email data set is 3921 x 21.

This means that there are 3,921 observations/rows and 21 variables/columns.

Question 4

The $ sign is used to specify a certain variable of a dataframe. For example, the following will give me the sum of the cylinders variable in the mtcars dataframe.

# Sum of the cyl variable in mtcars dataframe
sum(mtcars$cyl)

## [1] 198

In the email dataframe, what is the sum of the values in the line_breaks variable? State your answer in a complete sentence beneath the chunk.

#This code allows us to find the sum of the values of the line_breaks variable. 
sum(email$line_breaks)

## [1] 904412

The sum of the values of the line_breaks variable in the email data set is 904,412.

Question 5

In the email dataset, the first variable called spam has a value of 0 if the email was not spam and a value of 1 if the email was spam. You can use the table() function on the spam variable to see how many emails fell into the spam and non-spam categories.

Using an r chunk, calculate what percent of the emails in this dataframe were considered spam? Then state your answer beneath the chunk in a complete sentence, rounded to nearest tenth of a percent.

#This code allows us to see how many emails were spam and how many were non-spam. 
table(email$spam)

## 
##    0    1 
## 3554  367

#This code allows us to calculate what percentage of emails were spam in the 'email' data set. 
(367/(3554 + 367)) * 100

## [1] 9.359857

In the email data set, it was found that 3,554 emails were not spam and 367 were spam.

This means that 9.4% of the emails in the ‘email’ data set were considered spam.

Question 6

We can use the subset() function to get subsets of a dataframe based on a given characteristic. For example, the code below will store a new dataframe that only contains the observations in which the word “dollar” or a $ sign was found in the email.

# Storing emails with "dollar" in them to an object called money
money <- subset(email, email$dollar > 0)

Use an r chunk to find the percent of emails in the money object that are spam.

#This code allows us to see how many emails that mention the word "dollar" or a dollar sign were spam or not. 
table(money$spam)

## 
##   0   1 
## 668  78

#This code allows us to calculate the percentage of how many "money" emails are spam. 
78/(668 + 78) * 100

## [1] 10.45576

Of all the emails that mentioned the word “dollar” or “$” from the email data set, 78 were spam and 668 were not spam.

This means that 10.5% of the emails mentioning money were spam emails.

Comparing your answers from #7 and #8, do dollar signs or the word “dollar” seem to be a key characteristic of emails that are spam? Explain why or why not.

About 9.4% of the emails in the data set were considered spam, which is small compared to the percentage of those that are not spam. And of those emails that mentioned money, only 10.5% were spam. If the mention of money were a key characteristic of spam emails, we would be seeing a higher percentage. However, the majority of emails that mention “dollar” or a dollar sign are not spam. Therefore, the mention of “dollar” or “$” in emails is not a key characteristic of spam emails.

Question 7

Type View(airquality) into your RStudio console (not in an r chunk) and you will see a tab open up with the dataset in it for viewing. The dataframe gives air quality measurements in NYC over a 153 day period in the year 1973. In the chunk below, I have stored the airquality dataframe into an object called airquality and returned the names of the variables.

# Store airquality dataframe in object called airquality
airquality <- airquality

# Find the names of the variables in the dataframe
names(airquality)

## [1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"

Type ?airquality into the RStudio console to look at the documentation for the dataset and find out what the variable names in the dataframe represent.
This chunk below tries to find the sum of the solar radiation in langleys (a unit of solar radiation), but there is a problem, it returns NA.

# Find sum of Solar.R variable
sum(airquality$Solar.R)

## [1] NA

If you type View(airquality) into your RStudio console again, you can see that there are some values in the Solar.R variable, but also some NA values, which the sum function cannot handle.

Type ?sum into your RStudio console and look at the help documentation for that function. You’ll see in the Usage section sum(..., na.rm = FALSE). The second argument, na.rm, stands for “remove NAs” and this function defaults to FALSE, or no, do not remove NAs. If we want the sum function to ignore the NAs in a vector, we just need to set na.rm = TRUE as a second argument inside the function.

Use an r chunk to find the sum of the solar radiation (the Solar.R variable) by ignoring NAs.

#This code allows us to find the sum of the solar radiation by ignoring NAs. 
sum(airquality$Solar.R, na.rm=TRUE)

## [1] 27146

Question 8

Often we will need to load our own data into R for analysis. A common file type that data is stored in is called a “.csv” file or “comma separated values”. The common way to load a .csv in R is with the read.csv() function and storing it to an object.

[Example:\\](Example:){.uri} new_data <- read.csv("newdata.csv")

For this question:

Download the mens_health csv file from the “Additional Datasets” folder in Blackboard and save it your computer
Load the mens_health file into your R Environment. One way to do this is to use the Import Dataset button to load the file. You will then need to copy the line of code from the RStudio Console into an r chunk below in this RMarkdown document to ensure it runs in your final knitted project.

#This code allows the "mens_health" data set to run in the final document. 
mens_health <- read.csv("C:/Users/aless/Downloads/mens_health.csv")

Look at the structure of the dataframe by using the str() function

#This code allows us to look at the structure of the dataframe. 
str(mens_health)

## 'data.frame':    40 obs. of  14 variables:
##  $ MALE : int  1391 2129 2489 2490 2738 2988 2989 3346 3606 3607 ...
##  $ AGE  : int  58 22 32 31 28 46 41 56 20 54 ...
##  $ HT   : num  70.8 66.2 71.7 68.7 67.6 69.2 66.5 67.2 68.3 65.6 ...
##  $ WT   : num  169 144 179 176 153 ...
##  $ WAIST: num  90.6 78.1 96.5 87.7 87.1 ...
##  $ PULSE: int  68 64 88 72 64 72 60 88 76 60 ...
##  $ SYS  : int  125 107 126 110 110 107 113 126 137 110 ...
##  $ DIAS : int  78 54 81 68 66 83 71 72 85 71 ...
##  $ CHOL : int  522 127 740 49 230 316 590 466 121 578 ...
##  $ BMI  : num  23.8 23.2 24.6 26.2 23.5 24.5 21.5 31.4 26.4 22.7 ...
##  $ LEG  : num  42.5 40.2 44.4 42.8 40 47.3 43.4 40.1 42.1 36 ...
##  $ ELBOW: num  7.7 7.6 7.3 7.5 7.1 7.1 6.5 7.5 7.5 6.9 ...
##  $ WRIST: num  6.4 6.2 5.8 5.9 6 5.8 5.2 5.6 5.5 5.5 ...
##  $ ARM  : num  31.9 31 32.7 33.4 30.1 30.5 27.6 38 32 29.3 ...

In an r chunk below, find the sum of any numeric variable in that dataset.

#This code allows us to find the sum of height (inches) within the 'mens_health' data set. 
sum(mens_health$HT)

## [1] 2733.4