Practical Examination: R Programming for Statistics

ADDS-M03: Statistics for Data Science

Advanced Diploma in Data Science

National Institute of Business Management


Duration: 1.5 hours

Date: 2020-05-02


Total number of parts : 05

Each part carries 5 marks.


You need to answer 05 parts in total.


This is a R Markdown/html. All questions are R related

Write your answers/scripts in the cell and save this Notebook in .Rmd format/ generate .html using ‘Knit’ button.


Index Number: [COADDS192P-006]



Answer all questions in this Markdown itself.

Question 01

Using the provided “state” data, please answer the following questions:

Question 01-a

Compute the mean, trimmed mean, and median for the population using R: (A trimmed mean is widely used to avoid the influence of outliers.
For example, trimming the bottom and top 10% (a common choice) of the data will provide protection against outliers in all but the smallest data sets.) (1 Marks)

state <- read.csv("state.csv", header = TRUE)
mean(state$Population)
## [1] 6162876
mean(state$Population, trim = 0.1)
## [1] 4783697

Question 01-b

Displays some percentiles of the murder rate, such as;
(10%,25%,50%,75% and 90%) (2 Marks)

quantile(state$Murder.Rate, c(.1, .25, .5, .75, .9))
##   10%   25%   50%   75%   90% 
## 1.890 2.425 4.000 5.550 6.010

Question 01-c

Using R’s functions, compute standard deviation and interquartile range (IQR) (2 Marks)

sd(state$Population)
## [1] 6848235
sd(state$Murder.Rate)
## [1] 1.915736
IQR(state$Population)
## [1] 4847308
IQR(state$Murder.Rate)
## [1] 3.125


Question 02

Whole Life Organic, Inc., produces high-quality organic frozen turkeys for distribution in organic food markets in the upper Midwest. The company has developed a range feeding program with organic grain supplements to produce their product. The mean weight of its frozen turkeys is 18 pounds with a variance of 4. Historical experience indicates that weights can be approximated by the normal probability distribution.

Question 02-a

  1. What percentage of the company’s turkey units will be below 16 pounds?
  2. What percentage of the company’s turkey units will be over 20 pounds?
  3. What percentage of the company’s turkey units will be below 20 pounds?
  4. What percentage of the company’s turkey units will be between 16 and 20 pounds? (3 Marks)
#I.
x <- pnorm(16, mean = 18, sd=sqrt(4), lower.tail = TRUE)
x
## [1] 0.1586553
#II.
pnorm(20, mean = 18, sd=sqrt(4), lower.tail = FALSE)
## [1] 0.1586553
#III.
y <- pnorm(20, mean = 18, sd=sqrt(4), lower.tail = TRUE)
y
## [1] 0.8413447
#IV.
y-x
## [1] 0.6826895

Question 02-b

Find the cutoff point for the top 15% of sales? (2 Marks)

#Could not identify sales from question, went with weights...
qnorm(0.85, mean = 18, sd = sqrt(4))
## [1] 20.07287


Question 03

A useful way to summarize two categorical variables is a contingency table. Using the provided lc_loans data, show the contingency table between the grade of a personal loan and the outcome of that loan.

Question 03-a

Use the table command and obtain the contingency table (2 Marks)

lc <- read.csv("lc_loans.csv", header = TRUE)
table(lc$status, lc$grade)
##              
##                   A     B     C     D     E     F     G
##   Charged Off  1562  5302  6023  5007  2842  1526   409
##   Current     50051 93852 88928 53281 24639  8444  1990
##   Fully Paid  20408 31160 23147 13681  5949  2328   643
##   Late          469  2056  2777  2308  1374   606   199

Question 03-b

Install the library(descr) and use the CrossTable command to obtain the contingency table with counts and percentages
Hint (prop.c=F, prop.chisq=F, prop.t=F ) (3 Marks)

library(descr)
## Warning: package 'descr' was built under R version 3.6.3
CrossTable(lc$status, lc$grade, prop.c=F, prop.chisq=F, prop.t=F  )
##    Cell Contents 
## |-------------------------|
## |                       N | 
## |           N / Row Total | 
## |-------------------------|
## 
## ===============================================================================
##                lc$grade
## lc$status          A        B        C       D       E       F       G    Total
## -------------------------------------------------------------------------------
## Charged Off     1562     5302     6023    5007    2842    1526     409    22671
##                0.069    0.234    0.266   0.221   0.125   0.067   0.018    0.050
## -------------------------------------------------------------------------------
## Current        50051    93852    88928   53281   24639    8444    1990   321185
##                0.156    0.292    0.277   0.166   0.077   0.026   0.006    0.712
## -------------------------------------------------------------------------------
## Fully Paid     20408    31160    23147   13681    5949    2328     643    97316
##                0.210    0.320    0.238   0.141   0.061   0.024   0.007    0.216
## -------------------------------------------------------------------------------
## Late             469     2056     2777    2308    1374     606     199     9789
##                0.048    0.210    0.284   0.236   0.140   0.062   0.020    0.022
## -------------------------------------------------------------------------------
## Total          72490   132370   120875   74277   34804   12904    3241   450961
## ===============================================================================


Question 04

Question 04-a

What is a Box Plot and why to use Box Plot with categorical variable?(1 Marks)

ANSWER: The boxplot helps to summarize data and see the shape of it’s distribution. It’s easier to compare categorical variables against another variable using boxplots.

Question 04-b

Compare how the percentage of flight delays(pct_carrier_delay) varies across airlines by using box plot. (2 Marks) Hint [fill by airline and use the ylim as (0:50)]

fly <- read.csv("airline_stats.csv", header = TRUE)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.2
ggplot(data = fly) + geom_boxplot(mapping = aes(x= airline, y=pct_carrier_delay, fill =airline ))+
  ylim(0,50) +labs(title = "Delays according to airline", x= "airline", y="delay" )
## Warning: Removed 38 rows containing non-finite values (stat_boxplot).

Question 04-c

Compare how the density of flight delays(pct_carrier_delay) varies across airlines by using Violin plot (2 Marks)
Hint [fill by airline and use the ylim as (0:50)]

ggplot(data=fly) +geom_violin(mapping = aes(x=airline, y=pct_carrier_delay, fill =airline ))+
  ylim(0,50) + labs(title = "Delays according to airline", x= "airline", y="delay" )
## Warning: Removed 38 rows containing non-finite values (stat_ydensity).


Question 05

Using the provided “mpg” data, please answer the following questions:

Question 05-a

Which variables in mpg are categorical? Which variables are continuous? (1 Marks)

ANSWER: All variables can be considered as categorical, including cty and hwy. ### Question 05-b Make a scatterplot of cty versus cyl. (1 Marks)

mpg <- read.csv("mpgta.csv", header = TRUE)
plot(mpg$cty, mpg$hwy, xlab = "cty", ylab="hwy", main = "cty vs. hwy")

Question 05-c

Find the correlation value between cty and cyl (1 Marks)

cor(mpg$cty, mpg$cyl)
## [1] -0.8057714

and comment on the value.

Question 05-d

Map the colors of your points to the class variable to reveal the class of each car: (1 Marks)

## write your codes here

Question 05-e

Obtain the frequencies for manufacturers (1 Marks)

## write your codes here