This first project will introduce you to creating RMarkdown files and doing basic data manipulation inside R and RStudio.
In an r chunk, perform two operations: first add the numbers 9 and 23, then divide the number 42 by the sum of 83 and 101. Round your second answer to three decimal places.
# Adding 9 + 23
9+23
## [1] 32
# Dividing 42 by (83 + 101)
round(42/(83+101),3)
## [1] 0.228
Answer: The sum of the numbers 9 and 23 is …32
Answer: The quotient of 42 and (83 + 101) rounded to 3 decimal places is … 0.228
In an r chunk, store the dataframe mtcars into an object called mtcars and use the head() function to look at the first 6 rows of that object.
# Store the mtcars dataframe into an object called mtcars
mtcars <- mtcars
# View the first 6 rows of the mtcars dataframe
head(mtcars,6)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Use the ?mtcars command in the RStudio Console to pull up the help documentation to find out what the qsec variable describes in the mtcars dataset.
Answer: The qsec variable in the mtcars dataset … is the 1/4 mile time.
To load a package into R requires the use of the install.packages() and library() functions. Type: install.packages("openintro") into the RStudio Console to install that package on to your computer. The install.packages() function requires a character input, so " " are needed around the input.
The library() function takes in an object name, so do not use " " around the object name.
For this question, simply use an r chunk to load the library openintro which contains some datasets we will use going in subsequent problems.
# Load openintro library
library(openintro)
## Warning: package 'openintro' was built under R version 3.3.3
With the openintro library loaded, you can now access some datasets from it. In an r chunk, find the dimensions of the email dataset by using the dim() function on it.
# Dimensions of the email dataframe
dim(email)
## [1] 3921 21
Answer: The email dataframe contains … 3921 rows (observations) and 21 columns (variables).
The $ sign is used to specify a certain variable of a dataframe. For example, the following will give me the sum of the cylinders variable in the mtcars dataframe.
# Sum of the cyl variable in mtcars dataframe
sum(mtcars$cyl)
## [1] 198
In the email dataframe, what is the sum of the values in the line_breaks variable?
# Sum of the line_breaks variable in the email dataframe
sum(email$line_breaks)
## [1] 904412
Answer: The sum of line breaks in the email dataframe is … 904412.
In the email dataset, the first variable called spam has a value of 0 if the email was not spam and a value of 1 if the email was spam. You can use the table() function on the spam variable to see how many emails fell into the spam and non-spam categories.
In an r chunk, calculate what percent of the emails in this dataframe were considered spam? (Then state your answer rounded to nearest tenth of a percent)
# Table of spam/not-spam emails
table(email$spam)
##
## 0 1
## 3554 367
# Percent of emails that were spam
# We can calculate this for the specific case here, where the number of spam emails is 367 and the number of total emails is 3554+367. Then the percentage is 367/(3554+367)*100.
round(367/(3554+367)*100,1)
## [1] 9.4
# A more robust method calculates the percentage based on the parameters of the dataframe. If the data changes, the calculation will still work. Here the percentage is calculated as the sum of the spam column divided by the total number of rows (observations or emails).
round(sum(email$spam)/nrow(email)*100,1)
## [1] 9.4
Answer: 9.4%.
We can use the subset() function to get subsets of a dataframe based on a given characteristic.
For example, the code below will store a new dataframe that only contains the observations in which the word “dollar” or a $ sign was found in the email.
# Storing emails with "dollar" in them to an object called money
money <- subset(email, email$dollar > 0)
Find the percent of emails in the money subset that are spam.
# Enter your own comments here explaining your code
# I have a couple of ways to find the percentage of emails in the money subset that are spam. The first is to create a subset of money that contains only spam emails. This subset is called "money_spam" Then, divide the number of rows in money_spam by the number of rows in money.
money_spam <- subset(money,money$spam>0)
round(nrow(money_spam)/nrow(money)*100,1)
## [1] 10.5
#The second method is to divide the sum of the money$spam column by the total number of rows in money. I like this method more, but it doesn't force me to practice creating subsets.
round(sum(money$spam)/nrow(money)*100,1)
## [1] 10.5
Answer: 10.5%.
Comparing your answers from #7 and #8, do dollar signs or the word “dollar” seem to be a key characteristic of emails that are spam? Explain why or why not.
Answer: Dollar signs and the word “dollar” do NOT seem to be a key characteristic of emails that are spam. 9.4% of all emails contain dollar signs or the word “dollar”. 10.5% of spam contain these characteristics. The difference between those percentages isn’t large enough to confidently distinguish between spam and non-spam based on the mention of money.
Type View(airquality) into your RStudio console (not in an r chunk) and you will see a tab open up with the dataset in it for viewing. The dataframe gives air quality measurements in NYC over a 153 day period in the year 1973. In the chunk below, I have stored the airquality dataframe into an object called airquality and returned the names of the variables.
# Store airquality dataframe in object called airquality
airquality <- airquality
# Find the names of the variables in the dataframe
names(airquality)
## [1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
Type ?airquality into the RStudio console to find out what the variable names in the dataframe represent.
This chunk below tries to find the sum of the solar radiation in langleys (a unit of solar radiation), but there is a problem, it returns NA.
# Find sum of Solar.R variable
sum(airquality$Solar.R)
## [1] NA
If you type View(airquality) into your RStudio console again, you can see that there are some values in the Solar.R variable, but also some NA values, which the sum function cannot handle.
Type ?sum into your RStudio console and look at the help documentation for that function. You’ll see in the Usage section sum(..., na.rm = FALSE). The second argument, na.rm, stands for “remove NAs” and this function defaults to FALSE, or no, do not remove NAs. If we want the sum function to ignore the NAs in a vector, we just need to set na.rm = TRUE as a second argument inside the function.
Find the sum of the solar radiation (the Solar.R variable) by ignoring NAs.
# Sum of Solar.R variable ignoring the NAs
sum(airquality$Solar.R,na.rm = TRUE)
## [1] 27146
The sum of the Solar.R variable is …27146.
Often we will need to load our own data into R for analysis. A common file type that data is stored in is called a “.csv” file or “comma separated values”. The common way to load a .csv in R is with the read.csv() function and storing it to an object.
Example:
new_data <- read.csv("newdata.csv")
For this question:
project_data.# Set working directory to the folder where file exists
setwd("C:/Users/Henhoag/Desktop/Math143H/Projects")
# Read new dataset and store it as project_data
project_data <- read.csv("mens_health.csv")
# Find the sum of any variable
# Let's find out what is in mens_health first.
str(project_data)
## 'data.frame': 40 obs. of 14 variables:
## $ MALE : int 1391 2129 2489 2490 2738 2988 2989 3346 3606 3607 ...
## $ AGE : int 58 22 32 31 28 46 41 56 20 54 ...
## $ HT : num 70.8 66.2 71.7 68.7 67.6 69.2 66.5 67.2 68.3 65.6 ...
## $ WT : num 169 144 179 176 153 ...
## $ WAIST: num 90.6 78.1 96.5 87.7 87.1 ...
## $ PULSE: int 68 64 88 72 64 72 60 88 76 60 ...
## $ SYS : int 125 107 126 110 110 107 113 126 137 110 ...
## $ DIAS : int 78 54 81 68 66 83 71 72 85 71 ...
## $ CHOL : int 522 127 740 49 230 316 590 466 121 578 ...
## $ BMI : num 23.8 23.2 24.6 26.2 23.5 24.5 21.5 31.4 26.4 22.7 ...
## $ LEG : num 42.5 40.2 44.4 42.8 40 47.3 43.4 40.1 42.1 36 ...
## $ ELBOW: num 7.7 7.6 7.3 7.5 7.1 7.1 6.5 7.5 7.5 6.9 ...
## $ WRIST: num 6.4 6.2 5.8 5.9 6 5.8 5.2 5.6 5.5 5.5 ...
## $ ARM : num 31.9 31 32.7 33.4 30.1 30.5 27.6 38 32 29.3 ...
#The variables in this dataset are all of class integer or numeric, so any of them can be used with the sum function. Let's add up all the ages.
sum(project_data$AGE)
## [1] 1419
The sum of all ages in the mens_health data set is 1419.