This first project will introduce you to creating RMarkdown files and doing basic data manipulation inside R and RStudio.
In an r chunk, perform two operations: first add the numbers 9 and 23, then divide the number 42 by the sum of 83 and 101. Round your second answer to three decimal places.
# Adding 9 + 23
9+23
## [1] 32
# Dividing 42 by (83 + 101)
(83+101)/42
## [1] 4.380952
Answer: The sum of the numbers 9 and 23 is …32
Answer: The quotient of 42 and (83 + 101) rounded to 3 decimal places is …4.381
In an r chunk, store the dataframe mtcars into an object called mtcars and use the head() function to look at the first 6 rows of that object.
# Store the mtcars dataframe into an object called mtcars
mtcars -> mtcars
# View the first 6 rows of the mtcars dataframe
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Use the ?mtcars command in the RStudio Console to pull up the help documentation to find out what the qsec variable describes in the mtcars dataset.
??mctars
## No vignettes or demos or help files found with alias or concept or
## title matching 'mctars' using fuzzy matching.
Answer: The qsec variable in the mtcars dataset … 1/4 mile time. ## Question 4 To load a package into R requires the use of the install.packages() and library() functions. Type: install.packages("openintro") into the RStudio Console to install that package on to your computer. The install.packages() function requires a character input, so " " are needed around the input.
The library() function takes in an object name, so do not use " " around the object name.
For this question, simply use an r chunk to load the library openintro which contains some datasets we will use going in subsequent problems.
# Load (openintro)
library(openintro)
With the openintro library loaded, you can now access some datasets from it. In an r chunk, find the dimensions of the email dataset by using the dim() function on it.
# Dimensions of the email dataframe
dim(email)
## [1] 3921 21
Answer: The email dataframe contains … 3921 21
The $ sign is used to specify a certain variable of a dataframe. For example, the following will give me the sum of the cylinders variable in the mtcars dataframe.
# Sum of the cyl variable in mtcars dataframe
sum(mtcars$cyl)
## [1] 198
In the email dataframe, what is the sum of the values in the line_breaks variable? [1] 198
# Sum of the line_breaks variable in the email dataframe
sum(email$line_breaks)
## [1] 904412
Answer: The sum of line breaks in the email dataframe is …
[1] 904412 ## Question 7 In the email dataset, the first variable called spam has a value of 0 if the email was not spam and a value of 1 if the email was spam. You can use the table() function on the spam variable to see how many emails fell into the spam and non-spam categories.
In an r chunk, calculate what percent of the emails in this dataframe were considered spam? (Then state your answer rounded to nearest tenth of a percent)
# Table of spam/not-spam emails
table(email$spam)
##
## 0 1
## 3554 367
# Percent of emails that were spam
(367/3554)*100
## [1] 10.32639
Answer: 0 1 3554 367 non spam:3554 spam:367
[1] 10.326% was spam
We can use the subset() function to get subsets of a dataframe based on a given characteristic.
For example, the code below will store a new dataframe that only contains the observations in which the word “dollar” or a $ sign was found in the email.
# Storing emails with "dollar" in them to an object called money
money <- subset(email, email$dollar > 0)
Find the percent of emails in the money subset that are spam.
# Enter your own comments here explaining your code
table(money$spam)
##
## 0 1
## 668 78
(78/668)*100
## [1] 11.67665
# What was done here was that I did the table function to highlight the amount of spam emails, and then turned that into a percentage.
Answer: 11.677% was spam.
Comparing your answers from #7 and #8, do dollar signs or the word “dollar” seem to be a key characteristic of emails that are spam? Explain why or why not.
Answer: The word dollar was used to create the subset, and the dollar sign $ is an integral part in the code and reading what emails are span and which are not.
Type View(airquality) into your RStudio console (not in an r chunk) and you will see a tab open up with the dataset in it for viewing. The dataframe gives air quality measurements in NYC over a 153 day period in the year 1973. In the chunk below, I have stored the airquality dataframe into an object called airquality and returned the names of the variables.
# Store airquality dataframe in object called airquality
airquality <- airquality
# Find the names of the variables in the dataframe
names(airquality)
## [1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
?airquality
Type ?airquality into the RStudio console to find out what the variable names in the dataframe represent. Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island
Solar.R: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours at Central Park
Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport
Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.
This chunk below tries to find the sum of the solar radiation in langleys (a unit of solar radiation), but there is a problem, it returns NA.
# Find sum of Solar.R variable
sum(airquality$Solar.R)
## [1] NA
?sum
If you type View(airquality) into your RStudio console again, you can see that there are some values in the Solar.R variable, but also some NA values, which the sum function cannot handle.
Type ?sum into your RStudio console and look at the help documentation for that function. You’ll see in the Usage section sum(..., na.rm = FALSE). The second argument, na.rm, stands for “remove NAs” and this function defaults to FALSE, or no, do not remove NAs. If we want the sum function to ignore the NAs in a vector, we just need to set na.rm = TRUE as a second argument inside the function.
Find the sum of the solar radiation (the Solar.R variable) by ignoring NAs.
# Sum of Solar.R variable ignoring the NAs
sum(airquality$Solar.R , na.rm = TRUE)
## [1] 27146
The sum of the Solar.R variable is … 27146
Often we will need to load our own data into R for analysis. A common file type that data is stored in is called a “.csv” file or “comma separated values”. The common way to load a .csv in R is with the read.csv() function and storing it to an object.
Example:
new_data <- read.csv("newdata.csv")
For this question:
project_data.# Set working directory to the folder where file exists
# Read new dataset and store it as project_data
getwd()
## [1] "/Users/nlilly/Downloads"
read.csv("mens_health_1.csv",header = TRUE)
## MALE AGE HT WT WAIST PULSE SYS DIAS CHOL BMI LEG ELBOW WRIST
## 1 1391 58 70.8 169.1 90.6 68 125 78 522 23.8 42.5 7.7 6.4
## 2 2129 22 66.2 144.2 78.1 64 107 54 127 23.2 40.2 7.6 6.2
## 3 2489 32 71.7 179.3 96.5 88 126 81 740 24.6 44.4 7.3 5.8
## 4 2490 31 68.7 175.8 87.7 72 110 68 49 26.2 42.8 7.5 5.9
## 5 2738 28 67.6 152.6 87.1 64 110 66 230 23.5 40.0 7.1 6.0
## 6 2988 46 69.2 166.8 92.4 72 107 83 316 24.5 47.3 7.1 5.8
## 7 2989 41 66.5 135.0 78.8 60 113 71 590 21.5 43.4 6.5 5.2
## 8 3346 56 67.2 201.5 103.3 88 126 72 466 31.4 40.1 7.5 5.6
## 9 3606 20 68.3 175.2 89.1 76 137 85 121 26.4 42.1 7.5 5.5
## 10 3607 54 65.6 139.0 82.5 60 110 71 578 22.7 36.0 6.9 5.5
## 11 3608 17 63.0 156.3 86.7 96 109 65 78 27.8 44.2 7.1 5.3
## 12 3610 73 68.3 186.6 103.3 72 153 87 265 28.1 36.7 8.1 6.7
## 13 3747 52 73.1 191.1 91.8 56 112 77 250 25.2 48.4 8.0 5.2
## 14 4832 25 67.6 151.3 75.6 64 119 81 265 23.3 41.0 7.0 5.7
## 15 4839 29 68.0 209.4 105.5 60 113 82 273 31.9 39.8 6.9 6.0
## 16 5599 17 71.0 237.1 108.7 64 125 76 272 33.1 45.2 8.3 6.6
## 17 5600 41 61.3 176.7 104.0 84 131 80 972 33.2 40.2 6.7 5.7
## 18 5601 52 76.2 220.6 103.0 76 121 75 75 26.7 46.2 7.9 6.0
## 19 6226 32 66.3 166.1 91.3 84 132 81 138 26.6 39.0 7.5 5.7
## 20 7190 20 69.7 137.4 75.2 88 112 44 139 19.9 44.8 6.9 5.6
## 21 7192 20 65.4 164.2 87.7 72 121 65 638 27.1 40.9 7.0 5.6
## 22 7194 29 70.0 162.4 77.0 56 116 64 613 23.4 43.1 7.5 5.2
## 23 9073 18 62.9 151.8 85.0 68 95 58 762 27.0 38.0 7.4 5.8
## 24 9074 26 68.5 144.1 79.6 64 110 70 303 21.6 41.0 6.8 5.7
## 25 10864 33 68.3 204.6 103.8 60 110 66 690 30.9 46.0 7.4 6.1
## 26 12349 55 69.4 193.8 103.0 68 125 82 31 28.3 41.4 7.2 6.0
## 27 15515 53 69.2 172.9 97.1 60 124 79 189 25.5 42.7 6.6 5.9
## 28 16137 28 68.0 161.9 86.9 60 131 69 957 24.6 40.5 7.3 5.7
## 29 16521 28 71.9 174.8 88.0 56 109 64 339 23.8 44.2 7.8 6.0
## 30 16523 37 66.1 169.8 91.5 84 112 79 416 27.4 41.8 7.0 6.1
## 31 16768 40 72.4 213.3 102.9 72 127 72 120 28.7 47.2 7.5 5.9
## 32 17006 33 73.0 198.0 93.1 84 132 74 702 26.2 48.2 7.8 6.0
## 33 18392 26 68.0 173.3 98.9 88 116 81 1252 26.4 42.9 6.7 5.8
## 34 19017 53 68.7 214.5 107.5 56 125 84 288 32.1 42.8 8.2 5.9
## 35 19381 36 70.3 137.1 81.6 64 112 77 176 19.6 40.8 7.1 5.3
## 36 19635 34 63.7 119.5 75.7 56 125 77 277 20.7 42.6 6.6 5.3
## 37 19991 42 71.1 189.1 95.0 56 120 83 649 26.3 44.9 7.4 6.0
## 38 20518 18 65.6 164.7 91.1 60 118 68 113 26.9 41.1 7.0 6.1
## 39 21135 44 68.3 170.1 94.9 64 115 75 656 25.6 44.5 7.3 5.8
## 40 32230 20 66.3 151.0 79.9 72 115 65 172 24.2 44.0 7.1 5.4
## ARM
## 1 31.9
## 2 31.0
## 3 32.7
## 4 33.4
## 5 30.1
## 6 30.5
## 7 27.6
## 8 38.0
## 9 32.0
## 10 29.3
## 11 31.7
## 12 30.7
## 13 34.7
## 14 30.6
## 15 34.2
## 16 41.1
## 17 33.1
## 18 32.2
## 19 31.2
## 20 25.9
## 21 33.7
## 22 30.3
## 23 32.8
## 24 31.0
## 25 36.2
## 26 33.6
## 27 31.9
## 28 32.9
## 29 30.9
## 30 34.0
## 31 34.8
## 32 33.6
## 33 31.3
## 34 37.6
## 35 27.9
## 36 26.9
## 37 36.9
## 38 34.5
## 39 32.1
## 40 30.7
project_data <- read.csv("mens_health_1.csv")
# Find the sum of any variable
sum(project_data$AGE)
## [1] 1419
# sum of age =1419