Project #1 - Intro to R & RStudio

This first project will introduce you to creating RMarkdown files and doing basic data manipulation inside R and RStudio.

Question 1

In an r chunk, perform two operations: first add the numbers 9 and 23, then divide the number 42 by the sum of 83 and 101. Round your second answer to three decimal places.

# Adding 9 + 23
9+23

## [1] 32

# Dividing 42 by (83 + 101)
(83+101)/42

## [1] 4.380952

Answer: The sum of the numbers 9 and 23 is …32

Answer: The quotient of 42 and (83 + 101) rounded to 3 decimal places is …4.381

Question 2

In an r chunk, store the dataframe mtcars into an object called mtcars and use the head() function to look at the first 6 rows of that object.

# Store the mtcars dataframe into an object called mtcars
mtcars -> mtcars

# View the first 6 rows of the mtcars dataframe
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Question 3

Use the ?mtcars command in the RStudio Console to pull up the help documentation to find out what the qsec variable describes in the mtcars dataset.

??mctars

## No vignettes or demos or help files found with alias or concept or
## title matching 'mctars' using fuzzy matching.

Answer: The qsec variable in the mtcars dataset … 1/4 mile time. ## Question 4 To load a package into R requires the use of the install.packages() and library() functions. Type: install.packages("openintro") into the RStudio Console to install that package on to your computer. The install.packages() function requires a character input, so " " are needed around the input.

The library() function takes in an object name, so do not use " " around the object name.

For this question, simply use an r chunk to load the library openintro which contains some datasets we will use going in subsequent problems.

# Load (openintro)
library(openintro)

Question 5

With the openintro library loaded, you can now access some datasets from it. In an r chunk, find the dimensions of the email dataset by using the dim() function on it.

# Dimensions of the email dataframe
dim(email)

## [1] 3921   21

Answer: The email dataframe contains … 3921 21

Question 6

The $ sign is used to specify a certain variable of a dataframe. For example, the following will give me the sum of the cylinders variable in the mtcars dataframe.

# Sum of the cyl variable in mtcars dataframe
sum(mtcars$cyl)

## [1] 198

In the email dataframe, what is the sum of the values in the line_breaks variable? [1] 198

# Sum of the line_breaks variable in the email dataframe
sum(email$line_breaks)

## [1] 904412

Answer: The sum of line breaks in the email dataframe is …

[1] 904412 ## Question 7 In the email dataset, the first variable called spam has a value of 0 if the email was not spam and a value of 1 if the email was spam. You can use the table() function on the spam variable to see how many emails fell into the spam and non-spam categories.

In an r chunk, calculate what percent of the emails in this dataframe were considered spam? (Then state your answer rounded to nearest tenth of a percent)

# Table of spam/not-spam emails
table(email$spam)
## 
##    0    1 
## 3554  367

# Percent of emails that were spam
(367/3554)*100
## [1] 10.32639

Answer: 0 1 3554 367 non spam:3554 spam:367

[1] 10.326% was spam

Question 8

We can use the subset() function to get subsets of a dataframe based on a given characteristic.

For example, the code below will store a new dataframe that only contains the observations in which the word “dollar” or a $ sign was found in the email.

# Storing emails with "dollar" in them to an object called money
money <- subset(email, email$dollar > 0)

Find the percent of emails in the money subset that are spam.

# Enter your own comments here explaining your code
table(money$spam)
## 
##   0   1 
## 668  78
(78/668)*100
## [1] 11.67665
# What was done here was that I did the table function to highlight the amount of spam emails, and then turned that into a percentage.

Answer: 11.677% was spam.

Comparing your answers from #7 and #8, do dollar signs or the word “dollar” seem to be a key characteristic of emails that are spam? Explain why or why not.

Answer: The word dollar was used to create the subset, and the dollar sign $ is an integral part in the code and reading what emails are span and which are not.

Question 9

Type View(airquality) into your RStudio console (not in an r chunk) and you will see a tab open up with the dataset in it for viewing. The dataframe gives air quality measurements in NYC over a 153 day period in the year 1973. In the chunk below, I have stored the airquality dataframe into an object called airquality and returned the names of the variables.

# Store airquality dataframe in object called airquality
airquality <- airquality

# Find the names of the variables in the dataframe
names(airquality)

## [1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"

?airquality

Type ?airquality into the RStudio console to find out what the variable names in the dataframe represent. Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island

Solar.R: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours at Central Park

Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport

Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.

This chunk below tries to find the sum of the solar radiation in langleys (a unit of solar radiation), but there is a problem, it returns NA.

# Find sum of Solar.R variable
sum(airquality$Solar.R)

## [1] NA

?sum

If you type View(airquality) into your RStudio console again, you can see that there are some values in the Solar.R variable, but also some NA values, which the sum function cannot handle.

Type ?sum into your RStudio console and look at the help documentation for that function. You’ll see in the Usage section sum(..., na.rm = FALSE). The second argument, na.rm, stands for “remove NAs” and this function defaults to FALSE, or no, do not remove NAs. If we want the sum function to ignore the NAs in a vector, we just need to set na.rm = TRUE as a second argument inside the function.

Find the sum of the solar radiation (the Solar.R variable) by ignoring NAs.

# Sum of Solar.R variable ignoring the NAs
sum(airquality$Solar.R , na.rm = TRUE)

## [1] 27146

The sum of the Solar.R variable is … 27146

Question 10

Often we will need to load our own data into R for analysis. A common file type that data is stored in is called a “.csv” file or “comma separated values”. The common way to load a .csv in R is with the read.csv() function and storing it to an object.

Example:
new_data <- read.csv("newdata.csv")

For this question:

Download the mens_health csv file from the Course Projects folder in MyOpenMath and save it to your some folder on your computer
Set the working directory in an r chunk to the folder where you saved the file to
Use the read.csv() function to load the file and store it into an object called project_data.
Find the sum of any variable in that dataset

# Set working directory to the folder where file exists

# Read new dataset and store it as project_data
getwd()

## [1] "/Users/nlilly/Downloads"

read.csv("mens_health_1.csv",header = TRUE)

##     MALE AGE   HT    WT WAIST PULSE SYS DIAS CHOL  BMI  LEG ELBOW WRIST
## 1   1391  58 70.8 169.1  90.6    68 125   78  522 23.8 42.5   7.7   6.4
## 2   2129  22 66.2 144.2  78.1    64 107   54  127 23.2 40.2   7.6   6.2
## 3   2489  32 71.7 179.3  96.5    88 126   81  740 24.6 44.4   7.3   5.8
## 4   2490  31 68.7 175.8  87.7    72 110   68   49 26.2 42.8   7.5   5.9
## 5   2738  28 67.6 152.6  87.1    64 110   66  230 23.5 40.0   7.1   6.0
## 6   2988  46 69.2 166.8  92.4    72 107   83  316 24.5 47.3   7.1   5.8
## 7   2989  41 66.5 135.0  78.8    60 113   71  590 21.5 43.4   6.5   5.2
## 8   3346  56 67.2 201.5 103.3    88 126   72  466 31.4 40.1   7.5   5.6
## 9   3606  20 68.3 175.2  89.1    76 137   85  121 26.4 42.1   7.5   5.5
## 10  3607  54 65.6 139.0  82.5    60 110   71  578 22.7 36.0   6.9   5.5
## 11  3608  17 63.0 156.3  86.7    96 109   65   78 27.8 44.2   7.1   5.3
## 12  3610  73 68.3 186.6 103.3    72 153   87  265 28.1 36.7   8.1   6.7
## 13  3747  52 73.1 191.1  91.8    56 112   77  250 25.2 48.4   8.0   5.2
## 14  4832  25 67.6 151.3  75.6    64 119   81  265 23.3 41.0   7.0   5.7
## 15  4839  29 68.0 209.4 105.5    60 113   82  273 31.9 39.8   6.9   6.0
## 16  5599  17 71.0 237.1 108.7    64 125   76  272 33.1 45.2   8.3   6.6
## 17  5600  41 61.3 176.7 104.0    84 131   80  972 33.2 40.2   6.7   5.7
## 18  5601  52 76.2 220.6 103.0    76 121   75   75 26.7 46.2   7.9   6.0
## 19  6226  32 66.3 166.1  91.3    84 132   81  138 26.6 39.0   7.5   5.7
## 20  7190  20 69.7 137.4  75.2    88 112   44  139 19.9 44.8   6.9   5.6
## 21  7192  20 65.4 164.2  87.7    72 121   65  638 27.1 40.9   7.0   5.6
## 22  7194  29 70.0 162.4  77.0    56 116   64  613 23.4 43.1   7.5   5.2
## 23  9073  18 62.9 151.8  85.0    68  95   58  762 27.0 38.0   7.4   5.8
## 24  9074  26 68.5 144.1  79.6    64 110   70  303 21.6 41.0   6.8   5.7
## 25 10864  33 68.3 204.6 103.8    60 110   66  690 30.9 46.0   7.4   6.1
## 26 12349  55 69.4 193.8 103.0    68 125   82   31 28.3 41.4   7.2   6.0
## 27 15515  53 69.2 172.9  97.1    60 124   79  189 25.5 42.7   6.6   5.9
## 28 16137  28 68.0 161.9  86.9    60 131   69  957 24.6 40.5   7.3   5.7
## 29 16521  28 71.9 174.8  88.0    56 109   64  339 23.8 44.2   7.8   6.0
## 30 16523  37 66.1 169.8  91.5    84 112   79  416 27.4 41.8   7.0   6.1
## 31 16768  40 72.4 213.3 102.9    72 127   72  120 28.7 47.2   7.5   5.9
## 32 17006  33 73.0 198.0  93.1    84 132   74  702 26.2 48.2   7.8   6.0
## 33 18392  26 68.0 173.3  98.9    88 116   81 1252 26.4 42.9   6.7   5.8
## 34 19017  53 68.7 214.5 107.5    56 125   84  288 32.1 42.8   8.2   5.9
## 35 19381  36 70.3 137.1  81.6    64 112   77  176 19.6 40.8   7.1   5.3
## 36 19635  34 63.7 119.5  75.7    56 125   77  277 20.7 42.6   6.6   5.3
## 37 19991  42 71.1 189.1  95.0    56 120   83  649 26.3 44.9   7.4   6.0
## 38 20518  18 65.6 164.7  91.1    60 118   68  113 26.9 41.1   7.0   6.1
## 39 21135  44 68.3 170.1  94.9    64 115   75  656 25.6 44.5   7.3   5.8
## 40 32230  20 66.3 151.0  79.9    72 115   65  172 24.2 44.0   7.1   5.4
##     ARM
## 1  31.9
## 2  31.0
## 3  32.7
## 4  33.4
## 5  30.1
## 6  30.5
## 7  27.6
## 8  38.0
## 9  32.0
## 10 29.3
## 11 31.7
## 12 30.7
## 13 34.7
## 14 30.6
## 15 34.2
## 16 41.1
## 17 33.1
## 18 32.2
## 19 31.2
## 20 25.9
## 21 33.7
## 22 30.3
## 23 32.8
## 24 31.0
## 25 36.2
## 26 33.6
## 27 31.9
## 28 32.9
## 29 30.9
## 30 34.0
## 31 34.8
## 32 33.6
## 33 31.3
## 34 37.6
## 35 27.9
## 36 26.9
## 37 36.9
## 38 34.5
## 39 32.1
## 40 30.7

project_data <- read.csv("mens_health_1.csv")
# Find the sum of any variable
sum(project_data$AGE)

## [1] 1419

# sum of age =1419