This first project will introduce you to creating RMarkdown files and doing basic data manipulation with R and RStudio.
In a single r chunk, perform two operations: first add the numbers 9 and 23, then divide the number 42 by the sum of 83 and 101 State both answers in complete sentences beneath your r chunk, with your second answer rounded to three decimal places.
# 9 plus 23
9 + 23
## [1] 32
# 42 divided by the sum of 83 plus 101
42/(83+101)
## [1] 0.2282609
Nine plus twenty-three equals 32. Forty-two divided by the sum of 83 and 101 is 0.228.
In an r chunk below, store the dataframe mtcars into an object called mtcars and use the head() function to look at the first 6 rows of that object.
# Store mtcars dataset into our environment.
mtcars<-mtcars
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Use the ?mtcars command in the RStudio Console to pull up the help documentation to find out what the qsec variable describes in the mtcars dataset. No r chunk is required here, just a complete sentence answer.
The numeric variable qsec represents the observation’s 1/4 mile time.
To load a package into R requires the use of the install.packages() and library() functions. Type: install.packages(“openintro”) into the RStudio Console to install that package on to your computer. The install.packages() function requires a character input, so " " are needed around the input.
The library() function takes in an object name, so do not use " " around the object name.
Use an r chunk below to load the library openintro which contains some datasets we will use going in subsequent problems.
# Loading `library` package into R.
library(openintro)
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
##
## cars, trees
With the openintro library loaded, you can now access some datasets from it. In an r chunk, find the dimensions of the email dataset by using the dim() function on it.
# Finding the dimensions of the dataset `email`
dim(email)
## [1] 3921 21
The dataset email has 3921 rows (observations) and 21 columns (variables).
The $ sign is used to specify a certain variable of a dataframe. For example, the following will give me the sum of the cylinders variable in the mtcars dataframe.
# Sum of the cyl variable in mtcars dataframe
sum(mtcars$cyl)
## [1] 198
In the email dataframe, what is the sum of the values in the line_breaks variable? State your answer in a complete sentence beneath the chunk.
# Storing the dataset `email` into our environment.
email<-email
# Finding the sum of the values in the variable `line_breaks`
sum(email$line_breaks)
## [1] 904412
The sum of all line breaks in the ‘email’ dataset is 904,401.
In the email dataset, the first variable called spam has a value of 0 if the email was not spam and a value of 1 if the email was spam. You can use the table() function on the spam variable to see how many emails fell into the spam and non-spam categories.
# Finding out how many emails fell into the `spam` (value of 0) and `non-spam` (value of 1) categories.
table(email$spam)
##
## 0 1
## 3554 367
Of the 3921 observations in the email dataset, 367 were spam and 3554 were not spam.
In an r chunk, calculate what percent of the emails in this dataframe were considered spam? Then state your answer beneath the chunk in a complete sentence, rounded to nearest tenth of a percent.
# Finding the overall percentage of non-spam by dividing the number of non-spam emails (367) by the total number of emails (3921)
367/3921
## [1] 0.09359857
Of all the observations in the dataframe email, 9.4% were spam.
We can use the subset() function to get subsets of a dataframe based on a given characteristic. For example, the code below will store a new dataframe that only contains the observations in which the word “dollar” or a $ sign was found in the email.
# Storing emails with "dollar" in them to an object called money
money <- subset(email, email$dollar > 0)
Use an r chunk to find the percent of emails in the money object that are spam.
# Finding how many emails in the `money` subset are spam by finding the value of the `spam` variable within that subset.
table(money$spam)
##
## 0 1
## 668 78
# Calculating the percentage of spam emails in the `money` subset.
78/746
## [1] 0.1045576
Of the 746 emails that contain dollar signs, 78, or 10.5% are spam.
Comparing your answers from #7 and #8, do dollar signs or the word “dollar” seem to be a key characteristic of emails that are spam? Explain why or why not.
Given the above data, the word “dollar” and/or dollar signs do not not seem to be a key characteristic of spam emails. The overall incidence of spam among all observations (emails) is 9.4%, while the incidence of spam in emails containing the word “dollar” and/or dollar signs is 10.5%. These two statistics are comparable and do not indicate any correlation between spam and allusions to money within an email.
Type View(airquality) into your RStudio console (not in an r chunk) and you will see a tab open up with the dataset in it for viewing. The dataframe gives air quality measurements in NYC over a 153 day period in the year 1973. In the chunk below, I have stored the airquality dataframe into an object called airquality and returned the names of the variables.
# Store airquality dataframe in object called airquality
airquality <- airquality
# Find the names of the variables in the dataframe
names(airquality)
## [1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
Type ?airquality into the RStudio console to look at the documentation for the dataset and find out what the variable names in the dataframe represent.
This chunk below tries to find the sum of the solar radiation in langleys (a unit of solar radiation), but there is a problem, it returns NA.
# Find sum of Solar.R variable
sum(airquality$Solar.R)
## [1] NA
If you type View(airquality) into your RStudio console again, you can see that there are some values in the Solar.R variable, but also some NA values, which the sum function cannot handle.
Type ?sum into your RStudio console and look at the help documentation for that function. You’ll see in the Usage section sum(..., na.rm = FALSE). The second argument, na.rm, stands for “remove NAs” and this function defaults to FALSE, or no, do not remove NAs. If we want the sum function to ignore the NAs in a vector, we just need to set na.rm = TRUE as a second argument inside the function.
Use an r chunk to find the sum of the solar radiation (the Solar.R variable) by ignoring NAs.
# Finding sum of Solar Radiation variable
sum(airquality$Solar.R, na.rm = TRUE)
## [1] 27146
The sum of solar radiation in langleys is 27146.
Often we will need to load our own data into R for analysis. A common file type that data is stored in is called a “.csv” file or “comma separated values”. The common way to load a .csv in R is with the read.csv() function and storing it to an object.
Example:
new_data <- read.csv("newdata.csv")
For this question:
#Storing `mens_health` into my environment
mens_health <- read.csv("C:/Users/jessi/OneDrive/Desktop/Statistics/Statistics/mens_health.csv")
# Finding the sum of the variable `PULSE` in the dataset `mens_health`
sum(mens_health$PULSE)
## [1] 2776
The sum of the variable PULSE is 2776.