Instructions

  1. Update the author line at the top to have your name in it.
  2. You must knit this document to an html file and publish it to RPubs. Once you have published your project to the web, you must copy the web url link into the appropriate Course Project assignment in MyOpenMath before 11:59pm on the due date.
  3. Answer all the following questions completely. Some may ask for written responses.
  4. Use R chunks for code to be evaluated where needed and always comment all of your code so the reader can understand what your code aims to accomplish.
  5. Proofread your knitted document before publishing it to ensure it looks the way you want it to. Tip: Use double spaces at the end of a line to create a line break and make sure text does not have a header label that isn’t supposed to.

Purpose

This first project will introduce you to creating RMarkdown files and doing basic data manipulation with R and RStudio.


Question 1

In a single r chunk, perform two operations: first add the numbers 9 and 23, then divide the number 42 by the sum of 83 and 101. Round your second answer to three decimal places. State both answers in complete sentences beneath your r chunk.

# Find the sum of 9 and 23.
sum(c(9,23))
## [1] 32
sum(9,23)
## [1] 32
9+23
## [1] 32
# Divide number 42 by the sum of 83 and 101.
42/(83+101)
## [1] 0.2282609
#Round the division  result to three decimal places
#0.2282609 -> 0.228

Question 2

In an r chunk, store the dataframe mtcars into an object called mtcars and use the head() function to look at the first 6 rows of that object.

# Store the dataframe "mtcars" into an object called "mtcars"
mtcars <- mtcars

# Using the "head()" function look at the first 6 rows of the mtcars object
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Question 3

Use the ?mtcars command in the RStudio Console to pull up the help documentation to find out what the qsec variable describes in the mtcars dataset. No r chunk is required here, just a complete sentence answer.

qsec variable describes “1/4 mile time” in the “mtcars” dataset“.

Question 4

To load a package into R requires the use of the install.packages() and library() functions. Type: install.packages("openintro") into the RStudio Console to install that package on to your computer. The install.packages() function requires a character input, so " " are needed around the input.

The library() function takes in an object name, so do not use " " around the object name.

For this question, simply use an r chunk to load the library openintro which contains some datasets we will use going in subsequent problems.

# Load the library "openintro" using the library function.
library(openintro)
## Please visit openintro.org for free statistics materials
## 
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
## 
##     cars, trees

Question 5

With the openintro library loaded, you can now access some datasets from it. In an r chunk, find the dimensions of the email dataset by using the dim() function on it.

# First we need to store "email" dataframe in the Global Environment.
email <- email

# Find the dimensions of the "email" dataset by using the "dim()" function.
dim(email)
## [1] 3921   21

Question 6

The $ sign is used to specify a certain variable of a dataframe. For example, the following will give me the sum of the cylinders variable in the mtcars dataframe.

# Sum of the cyl variable in mtcars dataframe
sum(mtcars$cyl)
## [1] 198

In the email dataframe, what is the sum of the values in the line_breaks variable? State your answer in a complete sentence beneath the chunk.

#Find the sum of the values in the Line_breaks variable in the "email" dataframe.
sum(email$line_breaks)
## [1] 904412

The sum of values in the line_breaks variable of the “email” dataframe equals 904412 line breaks.

Question 7

In the email dataset, the first variable called spam has a value of 0 if the email was not spam and a value of 1 if the email was spam. You can use the table() function on the spam variable to see how many emails fell into the spam and non-spam categories.

In an r chunk, calculate what percent of the emails in this dataframe were considered spam? Then state your answer beneath the chunk in a complete sentence, rounded to nearest tenth of a percent.

# Using function "table()" find how many emails in the dataframe "email" fell into the spam and non-spam categories.
table(email$spam)
## 
##    0    1 
## 3554  367
# Calculate percentage of the emails in this dataframe that are considered spam. We know that there's 3921 objects in the dataframe. To find the percentage of spam emails we have to divide then number of spam emails by the number of objects.

table(email$spam)[2]/dim(email)[1]
##          1 
## 0.09359857
#or 

367/3921
## [1] 0.09359857
#To convert decimal number to a percentage we have to multiply that number by 100.
0.09359857*100
## [1] 9.359857

The percentage of spam emails rounded to the nearest tenth of a percent in the dataframe “email” is 9.4%.

Question 8We can use thesubset()` function to get subsets of a dataframe based on a given characteristic.

For example, the code below will store a new dataframe that only contains the observations in which the word “dollar” or a $ sign was found in the email.

# Storing emails with "dollar" in them to an object called money
money <- subset(email, dollar > 0)

Use an r chunk to find the percent of emails in the money object that are spam.

# Find how many emails in the dataframe "money" are spam.

emails_with_spam <- subset(money, spam == 1)

#or

table(money$spam)
## 
##   0   1 
## 668  78
# Find percent of emails in the "money" object that are spam. Because there are 78 observations in the dataframe emails_with_spam (or because we found that there are 78 observations that are spam using table() function), we can find percent of emails with spam in the dataframe "money" by dividing 78 observations in the dataframe "emails_with_spam" by the 746 observations in the dataframe "money", and then by multiplying the decimal found by 100.

dim(emails_with_spam)[1]/dim(money)[1]*100
## [1] 10.45576
#or 

78/746*100
## [1] 10.45576

Percent of emails rounded to the nearest one hundredth in the dataframe “money” with spam is 10.46%.

Comparing your answers from #7 and #8, do dollar signs or the word “dollar” seem to be a key characteristic of emails that are spam? Explain why or why not.

Dollar sign and the word “dollar” are not key characteristic of emails that are spam, because only 10.46% of the emails that have dollar sign or word “dollar” in them are spam. The group(dataframe “money”) of emails that has dollar sign or word “dollar” in them is part of the bigger dataframe “email”. Overall percentage of spam emails is relatively low - 9.36%, which means that 10.46% falls somewhere among all 9.36% emails that are spam.

Question 9

Type View(airquality) into your RStudio console (not in an r chunk) and you will see a tab open up with the dataset in it for viewing. The dataframe gives air quality measurements in NYC over a 153 day period in the year 1973. In the chunk below, I have stored the airquality dataframe into an object called airquality and returned the names of the variables.

# Store airquality dataframe in object called airquality
airquality <- airquality

# Find the names of the variables in the dataframe
names(airquality)
## [1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"

Type ?airquality into the RStudio console to find out what the variable names in the dataframe represent.

A data frame with 154 observations on 6 variables.

[,1] Ozone numeric Ozone (ppb) [,2] Solar.R numeric Solar R (lang) [,3] Wind numeric Wind (mph) [,4] Temp numeric Temperature (degrees F) [,5] Month numeric Month (1–12) [,6] Day numeric Day of month (1–31)

This chunk below tries to find the sum of the solar radiation in langleys (a unit of solar radiation), but there is a problem, it returns NA.

# Find sum of Solar.R variable
sum(airquality$Solar.R)
## [1] NA

If you type View(airquality) into your RStudio console again, you can see that there are some values in the Solar.R variable, but also some NA values, which the sum function cannot handle.

Type ?sum into your RStudio console and look at the help documentation for that function. You’ll see in the Usage section sum(..., na.rm = FALSE). The second argument, na.rm, stands for “remove NAs” and this function defaults to FALSE, or no, do not remove NAs. If we want the sum function to ignore the NAs in a vector, we just need to set na.rm = TRUE as a second argument inside the function.

Use an r chunk to find the sum of the solar radiation (the Solar.R variable) by ignoring NAs.

#Find the sum of the solar radiation using sum() function; remove NA values.
sum(airquality$Solar.R, na.rm = TRUE)
## [1] 27146

Question 10

Often we will need to load our own data into R for analysis. A common file type that data is stored in is called a “.csv” file or “comma separated values”. The common way to load a .csv in R is with the read.csv() function and storing it to an object.

Example:
new_data <- read.csv("newdata.csv")

For this question:

  • Download the mens_health csv file from the Course Projects folder in MyOpenMath and save it to your some folder on your computer
  • Use the read.csv() function to load the file and store it into an object called project_data.
  • The step above will require you to set your working directory at the beginning of the chunk with the setwd() function or to find the full pathway for the file.
  • Find the sum of any variable in that dataset
#First we need to set up our working directory using setwd() function. To do that we can go to the Files tab and find the folder where we want our files to be stored. By pressing "more"" tab and clicking on "Set As Working Directory" we will see in the console directions to the new working directory. By copying that "address", pasting it into setwd() function, and executing the function we will set up our new working directory.

setwd("~/Desktop/Statistics")

#Now we can use the read.csv() function to load the mens_health file and to store it into an object called project_data.

project_data <- read.csv("mens_health.csv")

#Find the sum of HT variable of all objects in the dataframe mens_health.

sum(project_data$HT)
## [1] 2733.4