This first project will introduce you to creating RMarkdown files and doing basic data manipulation with R and RStudio.
In a single r chunk, perform two operations: first add the numbers 9 and 23, then divide the number 42 by the sum of 83 and 101. State both answers in complete sentences beneath your r chunk, with your second answer rounded to three decimal places.
9 + 23
## [1] 32
42/( 83 + 101 )
## [1] 0.2282609
Nine plus twenty-three equals, thirty-two.
Fourty-two divided by the sum of eighty-three and one hundred one equals, zero point two, two, eight.
In an r chunk below, store the dataframe mtcars
into an
object called mtcars and use the head()
function to look at
the first 6 rows of that object.
mtcars <- mtcars
head(mtcars , n = 6)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Use the ?mtcars
command in the RStudio Console to pull
up the help documentation to find out what the qsec
variable describes in the mtcars
dataset, then write your
answer in a complete sentence. (No r chunk is required here).
In the dataset ‘mtcars’ the variable ‘qsec’ represents 1/4 mile time.
To load a package into R requires the use of the
install.packages()
and library()
functions.
Type: install.packages(“openintro”) into the RStudio Console to install
that package on to your computer. The install.packages() function
requires a character input, so ” ” are needed around the input.
The library() function takes in an object name, so do not use ” ” around the object name.
Use an r chunk below to load the library openintro
which
contains some datasets we will use going in subsequent problems.
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
With the openintro library loaded, you can now access some datasets
from it. In another r chunk, find the dimensions of the
email
dataset by using the dim()
function on
it.
dim(email)
## [1] 3921 21
The $ sign is used to specify a certain variable of a dataframe. For
example, the following will give me the sum of the cylinders variable in
the mtcars
dataframe.
# Sum of the cyl variable in mtcars dataframe
sum(mtcars$cyl)
## [1] 198
In the email
dataframe, what is the sum of the values in
the line_breaks variable? State your answer in a complete sentence
beneath the chunk.
sum(email$line_breaks)
## [1] 904412
the sum of values under the ‘line_breaks’ variable is 904412.
In the email
dataset, the first variable called spam has
a value of 0 if the email was not spam and a value of 1 if the email was
spam. You can use the table()
function on the spam variable
to see how many emails fell into the spam and non-spam categories.
table(email$spam)
##
## 0 1
## 3554 367
Using an r chunk, calculate what percent of the emails in this dataframe were considered spam? Then state your answer beneath the chunk in a complete sentence, rounded to nearest tenth of a percent.
3554 + 367
## [1] 3921
(3554/3921)*100
## [1] 90.64014
In the emails dataframe 90.64% of emails are spam emails.
We can use the subset()
function to get subsets of a
dataframe based on a given characteristic. For example, the code below
will store a new dataframe that only contains the observations in which
the word “dollar” or a $ sign was found in the email.
# Storing emails with "dollar" in them to an object called money
money <- subset(email, email$dollar > 0)
Use an r chunk to find the percent of emails in the
money
object that are spam.
table(money$spam)
##
## 0 1
## 668 78
668 + 78
## [1] 746
(668/746)*100
## [1] 89.54424
89.54% of email from the ‘money’ object are spam emails.
Comparing your answers from #7 and #8, do dollar signs or the word “dollar” seem to be a key characteristic of emails that are spam? Explain why or why not.
No, because the word ‘dollar’ is used for subsetting, while the dollar signs is used to refer to a variable.
Type View(airquality)
into your RStudio console (not in
an r chunk) and you will see a tab open up with the dataset in it for
viewing. The dataframe gives air quality measurements in NYC over a 153
day period in the year 1973. In the chunk below, I have stored the
airquality dataframe into an object called airquality and returned the
names of the variables.
# Store airquality dataframe in object called airquality
airquality <- airquality
# Find the names of the variables in the dataframe
names(airquality)
## [1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
Type ?airquality into the RStudio console to look at the
documentation for the dataset and find out what the variable names in
the dataframe represent.
This chunk below tries to find the sum of the solar radiation in
langleys (a unit of solar radiation), but there is a problem, it returns
NA.
# Find sum of Solar.R variable
sum(airquality$Solar.R)
## [1] NA
If you type View(airquality)
into your RStudio console
again, you can see that there are some values in the Solar.R variable,
but also some NA values, which the sum function cannot handle.
Type ?sum
into your RStudio console and look at the help
documentation for that function. You’ll see in the Usage section
sum(..., na.rm = FALSE)
. The second argument, na.rm, stands
for “remove NAs” and this function defaults to FALSE, or no, do not
remove NAs. If we want the sum function to ignore the NAs in a vector,
we just need to set na.rm = TRUE
as a second argument
inside the function.
Use an r chunk to find the sum of the solar radiation (the Solar.R variable) by ignoring NAs.
sum(airquality$Solar.R , na.rm = TRUE)
## [1] 27146
Often we will need to load our own data into R for analysis. A common
file type that data is stored in is called a “.csv” file or “comma
separated values”. The common way to load a .csv in R is with the
read.csv()
function and storing it to an object.
Example:
new_data <- read.csv("newdata.csv")
For this question:
str()
functionmenshealth <- read.csv("C:/Users/xi2yk/Downloads/mens_health (1).csv")
str(menshealth)
## 'data.frame': 40 obs. of 14 variables:
## $ MALE : int 1391 2129 2489 2490 2738 2988 2989 3346 3606 3607 ...
## $ AGE : int 58 22 32 31 28 46 41 56 20 54 ...
## $ HT : num 70.8 66.2 71.7 68.7 67.6 69.2 66.5 67.2 68.3 65.6 ...
## $ WT : num 169 144 179 176 153 ...
## $ WAIST: num 90.6 78.1 96.5 87.7 87.1 ...
## $ PULSE: int 68 64 88 72 64 72 60 88 76 60 ...
## $ SYS : int 125 107 126 110 110 107 113 126 137 110 ...
## $ DIAS : int 78 54 81 68 66 83 71 72 85 71 ...
## $ CHOL : int 522 127 740 49 230 316 590 466 121 578 ...
## $ BMI : num 23.8 23.2 24.6 26.2 23.5 24.5 21.5 31.4 26.4 22.7 ...
## $ LEG : num 42.5 40.2 44.4 42.8 40 47.3 43.4 40.1 42.1 36 ...
## $ ELBOW: num 7.7 7.6 7.3 7.5 7.1 7.1 6.5 7.5 7.5 6.9 ...
## $ WRIST: num 6.4 6.2 5.8 5.9 6 5.8 5.2 5.6 5.5 5.5 ...
## $ ARM : num 31.9 31 32.7 33.4 30.1 30.5 27.6 38 32 29.3 ...
sum(menshealth$num)
## [1] 0