Project #2 - Introduction to Data

Instructions

Update the author line at the top to have your name in it.
You must knit this document to an html file and publish it to RPubs. Once you have published your project to the web, you must copy the url link into the appropriate Course Project assignment in MyOpenMath before 9:00am on the due date.
Answer all the following questions completely. Some may ask for written responses.
Use R chunks for code to be evaluated where needed and always comment all of your code so the reader can understand what your code aims to accomplish.
Proofread your knitted document before publishing it to ensure it looks the way you want it to. Use double spaces at the end of a line to create a line break and make sure text does not have a header label that isn’t supposed to.

Purpose

This project will demonstrate your ability to do exploratory data analysis on single variables of data in R and RStudio. The entire project will use the NSCC Student Dataset, which you will download and load into R in question 1.

Question 1

Download the “nscc_student_data.csv” file from MyOpenMath and use the read.csv() function to store it into an object called “nscc_student_data”. Print the first few lines of each data set using the head() function. Print a summary of each data set using the str() function.

# Load dataset in and store as object "nscc_student_data"
nscc_student_data <- read.csv("nscc_student_data.csv")

# Preview of dataframe
head(nscc_student_data)

##   Gender PulseRate CoinFlip1 CoinFlip2 Height ShoeLength Age Siblings
## 1 Female        64         5         5     62      11.00  19        4
## 2 Female        75         4         6     62      11.00  21        3
## 3 Female        74         6         1     60      10.00  25        2
## 4 Female        65         4         4     62      10.75  19        1
## 5 Female        NA        NA        NA     66         NA  26        6
## 6 Female        72         6         5     67       9.75  21        1
##   RandomNum HoursWorking Credits    Birthday ProfsAge Coffee VoterReg
## 1       797           35      13      July 5       31     No      Yes
## 2       749           25      12 December 27       30    Yes      Yes
## 3        13           30       6  January 31       29    Yes       No
## 4       613           18       9        6-13       31    Yes      Yes
## 5        53           24      15       02-15       32     No      Yes
## 6       836           15       9    april 14       32     No      Yes

# Summary of dataframe with str()
str(nscc_student_data)

## 'data.frame':    40 obs. of  15 variables:
##  $ Gender      : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 2 2 1 2 ...
##  $ PulseRate   : int  64 75 74 65 NA 72 72 60 66 60 ...
##  $ CoinFlip1   : int  5 4 6 4 NA 6 6 3 7 6 ...
##  $ CoinFlip2   : int  5 6 1 4 NA 5 6 5 8 5 ...
##  $ Height      : num  62 62 60 62 66 ...
##  $ ShoeLength  : num  11 11 10 10.8 NA ...
##  $ Age         : int  19 21 25 19 26 21 19 24 24 20 ...
##  $ Siblings    : int  4 3 2 1 6 1 2 2 3 1 ...
##  $ RandomNum   : int  797 749 13 613 53 836 423 16 12 543 ...
##  $ HoursWorking: int  35 25 30 18 24 15 20 0 40 30 ...
##  $ Credits     : int  13 12 6 9 15 9 15 15 13 16 ...
##  $ Birthday    : Factor w/ 40 levels "02-15","03.14.1984",..: 32 25 30 18 1 21 19 27 35 31 ...
##  $ ProfsAge    : int  31 30 29 31 32 32 28 28 31 28 ...
##  $ Coffee      : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 2 2 2 1 ...
##  $ VoterReg    : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 2 2 1 ...

Question 2

a.) What are the dimensions of the nscc_student_data dataframe?

#Finding the dimensions of nscc_student_data using the dim() function
dim(nscc_student_data)

## [1] 40 15

The nscc_student_data dataframe has 40 rows and 15 columns.

b.) The chunk of code below will tell you how many values in the PulseRate variable exist (FALSE) and how many are NA (TRUE). How many values are in the variable are missing?

# How many values in PulseRate variable are missing
table(is.na(nscc_student_data$PulseRate))

## 
## FALSE  TRUE 
##    38     2

There are 2 values missing, these are the variables that were shown as “TRUE”.

Question 3

What is the mean of the pulse rate variable? What is the median of the pulse rate variable? Do they differ by much? If yes, explain why and which would be a better choice as the “center” or “average” of this variable.

#Finding the mean of the pulse rate variable
mean(nscc_student_data$PulseRate, na.rm=TRUE)

## [1] 73.47368

#Finding the median of the pulse rate variable, excluding N.A. values
median(nscc_student_data$PulseRate, na.rm = TRUE)

## [1] 70.5

The mean of the pulse rate variable is approximately 73.47, while the median is 70.50. These values omit the “N.A.” values recorded in the data. They do not differ by much, as they are only about three units apart from one another.

Question 4

What is the sample standard deviation of the pulse rate variable?

#Finding the standard deviation of the pulse rate variable
sd(nscc_student_data$PulseRate, na.rm = TRUE)

## [1] 12.51105

The standard deviation of the pulse rate variable is 12.51.

Question 5

What is the Five Number Summary and IQR of the pulse rate variable? Based on the definition of outliers being more than 1.5 IQRs above Q3 or below Q1, are there any values in the pulse rate variable that are considered outliers?

# Five Number Summary of the PulseRate variable
summary(nscc_student_data$PulseRate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   50.00   64.25   70.50   73.47   83.75   98.00       2

The Five Number Summary of the PulseRate variable shows that: the minimum value is 50.00, the first quartile is 64.25, the median is 70.50, the third quartile is 83.75, and the maximum value is 98.00. There are two NA’s in this data.

#Finding IQR of the PulseRate variable
IQR(nscc_student_data$PulseRate, na.rm = TRUE)

## [1] 19.5

The IQR of the PulseRate variable is: 19.5

#Calculating 1.5 IQRs to evaluate outliers
1.5*19.5

## [1] 29.25

#Calculating which values are outliers using 1st and 3rd Quartiles
64.25-29.25

## [1] 35

83.75+29.25

## [1] 113

Any values that are below 35 or above 113 would be considered outliers. Based on that, there are not any outliers.

Question 6

The Gender variable gives whether students identified as male or female. Create a table and a barplot of that variable.

# Table of results of Gender variable
table(nscc_student_data$Gender)

## 
## Female   Male 
##     27     13

# Barplot of results
barplot(table(nscc_student_data$Gender))

Question 7

Split the dataframe into two subsets – one that has all the males and another that has all the females. Store them into objects called “NSCC_males” and “NSCC_females”. The first one has been done for you as a template.

# Create males subset
NSCC_males <- subset(nscc_student_data, nscc_student_data$Gender == "Male")

# Create females subset
NSCC_females <- subset(nscc_student_data, nscc_student_data$Gender == "Female")

Question 8

What is the Five Number Summary for the pulse rate variable for each of the male and female subsets.

# Five Number Summary of males subset
summary(NSCC_males$PulseRate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   50.00   60.00   71.00   70.85   80.00   96.00

# Five Number Summary of females subset
summary(NSCC_females$PulseRate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   56.00   65.00   70.00   74.84   88.00   98.00       2

The Five Number Summaries of each dataset for the pulse rate variiable are:

Males: Minimum Value=50.00, 1st Quartile=60.00, Median=71.00, 3rd Quartile=80.00, Maximum Value=96.00

Females: Minimum Value=56.00, 1st Quartile=65.00, Median=70.00, 3rd Quartile=88.00, Maximum Value=98.00

Two values of NA were present in the data.

Question 9

Create side-by-side boxplots for the pulse rate variable each of the male and female subsets. Is there any noticeable difference between the spread of the variables?

# Create side-by-side boxplots for each subset
boxplot(NSCC_females$PulseRate, NSCC_males$PulseRate, names = c("Females","Males"))

Yes, there are noticeable differences between the spread of the variables, but they are minimal. Males have only a slightly higher median pulse rate than females, and the minimum pulse rate in males is lower than that of females. Similarly, the maximum pulse rate of males is lower than that of females. Additionally, the first and third quartiles of the data representing pulse rates of females are both higher than those of males. Overall, the differences are relatively small numerically, but suggest that even though the median pulse rate for both genders is about the same, females tend to have higher recorded pulse rates.

Question 10

Create a frequency distribution for how many males and females answered “Yes” or “No” to the variable “Coffee” by using the table() function. What percent of this sample of NSCC students drink coffee? Is there any noticeable difference in coffee drinking based on gender?

# Male Coffee Drinkers
table(NSCC_males$Coffee)

## 
##  No Yes 
##   3  10

# Females Coffee Drinkers
table(NSCC_females$Coffee)

## 
##  No Yes 
##   7  20

# Percent of Males that Drink Coffee
(10/13)*100

## [1] 76.92308

# Percent of Females that Drink Coffee
(20/27)*100

## [1] 74.07407

#Total percent of students who drink coffee regardless of gender
(30/40)*100

## [1] 75

Among the sample of students at NSCC, about 76.92% of males drink coffee, and about 74.07% of females drink coffee. Overall, 75% of students in the sample, regardless of gender, drink coffee. Because the percentages are so close, there is not a noticeable difference in coffee drinking based on gender.