Project #2 - Introduction to Data

Instructions

Update the author line at the top to have your name in it.
You must knit this document to an html file and publish it to RPubs. Once you have published your project to the web, you must copy the web url link into the appropriate Course Project assignment in MyOpenMath before 11:59pm on the due date.
Answer all the following questions completely. Some may ask for written responses.
Use R chunks for code to be evaluated where needed and always comment all of your code so the reader can understand what your code aims to accomplish.
Proofread your knitted document before publishing it to ensure it looks the way you want it to. Tip: Use double spaces at the end of a line to create a line break and make sure text does not have a header label that isn’t supposed to.

Purpose

This project will demonstrate your ability to do exploratory data analysis on single variables of data in R and RStudio. The entire project will use the NSCC Student Dataset, which you will need to load into R in question 1.

Question 1

Download the “nscc_student_data.csv” file from MyOpenMath and use the read.csv() function to store it into an object called “nscc_student_data”. Print the first few lines of the dataset using the head() function. Also print the structure of the dataset using the str() function.

# Load dataset in and store as object "nscc_student_data"
nscc_student_data <- read.csv("C:/Users/jperry23/Desktop/nscc_student_data.csv")

# Preview first 6 lines of dataset
head(nscc_student_data)

##   Gender PulseRate CoinFlip1 CoinFlip2 Height ShoeLength Age Siblings
## 1 Female        64         5         5     62      11.00  19        4
## 2 Female        75         4         6     62      11.00  21        3
## 3 Female        74         6         1     60      10.00  25        2
## 4 Female        65         4         4     62      10.75  19        1
## 5 Female        NA        NA        NA     66         NA  26        6
## 6 Female        72         6         5     67       9.75  21        1
##   RandomNum HoursWorking Credits Birthday ProfsAge Coffee VoterReg
## 1       797           35      13    5-Jul       31     No      Yes
## 2       749           25      12   27-Dec       30    Yes      Yes
## 3        13           30       6   31-Jan       29    Yes       No
## 4       613           18       9   13-Jun       31    Yes      Yes
## 5        53           24      15   15-Feb       32     No      Yes
## 6       836           15       9   14-Apr       32     No      Yes

# Structure of dataset
str(nscc_student_data)

## 'data.frame':    40 obs. of  15 variables:
##  $ Gender      : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 2 2 1 2 ...
##  $ PulseRate   : int  64 75 74 65 NA 72 72 60 66 60 ...
##  $ CoinFlip1   : int  5 4 6 4 NA 6 6 3 7 6 ...
##  $ CoinFlip2   : int  5 6 1 4 NA 5 6 5 8 5 ...
##  $ Height      : num  62 62 60 62 66 ...
##  $ ShoeLength  : num  11 11 10 10.8 NA ...
##  $ Age         : int  19 21 25 19 26 21 19 24 24 20 ...
##  $ Siblings    : int  4 3 2 1 6 1 2 2 3 1 ...
##  $ RandomNum   : int  797 749 13 613 53 836 423 16 12 543 ...
##  $ HoursWorking: int  35 25 30 18 24 15 20 0 40 30 ...
##  $ Credits     : int  13 12 6 9 15 9 15 15 13 16 ...
##  $ Birthday    : Factor w/ 39 levels "03.14.1984","11-Jul",..: 28 22 25 5 11 8 15 13 23 19 ...
##  $ ProfsAge    : int  31 30 29 31 32 32 28 28 31 28 ...
##  $ Coffee      : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 2 2 2 1 ...
##  $ VoterReg    : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 2 2 1 ...

The two data tables above are the preview of the first 6 lines and the structure of the dataset itself.

Question 2

a.) What are the dimensions of the nscc_student_data dataframe?

# Find the dimensions of the nscc_student_data dataframe
dim.data.frame(nscc_student_data)

## [1] 40 15

The dimensions of the above dataframe are 40x15, with 40 different observations over 15 different variables.

b.) The chunk of code below will tell you how many values in the PulseRate variable exist (FALSE) and how many are NA (TRUE). How many values are in the variable are missing?

# How many values in PulseRate variable are missing
table(is.na(nscc_student_data$PulseRate))

## 
## FALSE  TRUE 
##    38     2

The above table shows that out of 40 pulse rate results, 2 of them are not available/applicable. By stating that N/A is TRUE, states that there are two cases of N/A found within this variable.

Question 3

Use an r chunk to calculate the mean, median, and sample standard deviation of the pulse rate variable. Do the mean and median differ by much? If yes, explain why and which would be a better choice as the “center” or “average” of this variable.

mean(nscc_student_data$PulseRate, na.rm = TRUE)

## [1] 73.47368

median(nscc_student_data$PulseRate, na.rm = TRUE)

## [1] 70.5

sd(nscc_student_data$PulseRate, na.rm = TRUE)

## [1] 12.51105

The above data shows the mean (73.5), the median (70.5), and the standard deviation (12.5). Because the data of NSCC students’ pulse rates is a normal distribution w/ no outliers, the mean is the best way of measuring the average. The mean and median differ by ~ 3.

Question 4

Use an r chunk to calculate the Five Number Summary and IQR of the pulse rate variable? Clearly state your answer for each below the r chunk. Based on the definition of outliers being more than 1.5 IQRs below Q1 or above Q3, are there any values in the pulse rate variable that are considered outliers? Clearly state below the thresholds for a data to be considered an outlier.

summary(nscc_student_data$PulseRate, na.rm = TRUE)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   50.00   64.25   70.50   73.47   83.75   98.00       2

IQR(nscc_student_data$PulseRate, na.rm = TRUE)

## [1] 19.5

lowerq = quantile(nscc_student_data$PulseRate, na.rm = TRUE) [2]
upperq = quantile(nscc_student_data$PulseRate, na.rm = TRUE) [4]
iqr = upperq - lowerq
mild.threshold.upper = (iqr * 1.5) + upperq
mild.threshold.lower = lowerq - (iqr * 1.5)

mild.threshold.lower

## 25% 
##  35

mild.threshold.upper

## 75% 
## 113

The five number summary of the PulseRate variable is: 50.00 64.25 70.50 73.47 83.75 98.00
The IQR of the PulseRate variable is: 19.5

Any data values that are below (35) or above (113) would be considered to be outliers. Based on that, there are not any outliers.

Question 5

The Gender variable gives whether students identified as male or female. Create a table and a barplot of that variable.

table(nscc_student_data$Gender)

## 
## Female   Male 
##     27     13

Gender<-table(nscc_student_data$Gender)
barplot(Gender, ylim=range(pretty(c(0,30,5))))

The above barplot displays how many NSCC students identified as female and male, with 27 females and 13 males. The y-axis is scaled by 5 from 0 to 30.

Question 6

Split the dataframe into two subsets – one that has all the males and another that has all the females. Store them into objects called “NSCC_males” and “NSCC_females”. The first one has been done for you as a template.

# Create males subset
NSCC_males <- subset(nscc_student_data, nscc_student_data$Gender == "Male")

# Create females subset
NSCC_females<-subset(nscc_student_data, nscc_student_data$Gender == "Female")

By using subsets, I am able to redefine/rename the group of Gender data into a simpler title of NSCC_males and NSCC_females.

Use an r chunk below to generate the information which will give you the Five Number Summary for the pulse rate variable for each of the male and female subsets.

# Five Number Summary of males subset
summary(NSCC_males$PulseRate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   50.00   60.00   71.00   70.85   80.00   96.00

# Five Number Summary of females subset
summary((NSCC_females$PulseRate))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   56.00   65.00   70.00   74.84   88.00   98.00       2

The Five Number Summary of each dataset are:
Males: 50.00 60.00 71.00 70.85 80.00 96.00 Females: 56.00 65.00 70.00 74.84 88.00 98.00

In order from left to right, the numbers represent the following: min. value, first quartile, median, third quartile, and max. value. The first set corresponds to the pulse rate of the male group, while the second set reflects the pulse rate of the female group.

Question 7

Create side-by-side boxplots for the pulse rate variable each of the male and female subsets. Is there any noticeable difference between the two subsets?

# Create side-by-side boxplots for each subset
boxplot(NSCC_males$PulseRate, NSCC_females$PulseRate, names=c("Males", "Females"))
title(xlab = "Gender")
title(ylab="Pulse Rate")

The above side-by-side boxplot compares the pulse rate of NSCC males and the pulse rate of NSCC females. Generally, the female pulse rate is higher; although, the majority of females held pulse rates that were within the same range as the male pulse rates. While there may be a noticeable difference, it is not a significant difference.

Question 8

Create a frequency distribution for how many males and females answered “Yes” or “No” to the variable “Coffee” by using the table() function. What percent of this sample of NSCC students drink coffee? Is there any noticeable difference in coffee drinking based on gender?

# Male Coffee Drinkers
table(NSCC_males$Coffee)

## 
##  No Yes 
##   3  10

# Females Coffee Drinkers
table(NSCC_females$Coffee)

## 
##  No Yes 
##   7  20

# Percent of Males that Drink Coffee
10/40

## [1] 0.25

# Percent of Females that Drink Coffee
20/40

## [1] 0.5

This frequency distribution shows that 25% (10 out of 40) of NSCC males drink coffee, compared to the 50% (20 out of 40) of females who drink coffee. The remaining 25% (10 out of 40) of NSCC students reported they do not drink coffee.

The percent of NSCC students from this sample who drink coffee is 75%. The difference in coffee drinkers between males and females is considerable/noticeable at 25% difference, with more coffee drinkers being female.