STAT 380 – Statistical Consulting and Data Analysis I

Worksheet 1

Name: Sean Cranston

Directions: Please download a copy of this worksheet from D2L, fill out your answers to the questions below, and then submit your completed document to the D2L Dropbox for this assignment by noon on Thursday 8/27.

  1. -What’s your career goal? Circle one.
    1. Statistical consultant, statistician, Data scientist
      b. Other, specify: Actuary
    2. I am not sure.

2- How can this class help you achieve your goal? What skills, concept do you expect to learn or improve? (briefly)

Improve how to communicates (written) results obtained from an analysis.

Improve how to present an analysis.

3- Are you familiar with the following software?
a) SAS
b) JMP
c) R/R Studio
d) SPSS
e) Other, specify:

Use your statistical background and appropriate software to help your friend resolve the problem below: (this will count as a bonus point toward your final grade. Be ready for a short presentation of your analysis on Friday 08/28).

4- There are two types of buses in Calandria, NM (Express and Regular). A friend of yours has the impression that the Express bus is usually on time but that the Regular bus is often late. The data below were collected at random times over several months to answer this question. There are two variables: type of bus (E for Express, R for Regular) and promptness (L for Late and O for on time). Write a SAS or R or any other software program that reads in all of the data below (again, you decide how to enter the data). Analyze the data in a way you deem appropriate.

E O   E L   E L   R O   E O   E O   E O   R L   R O   R L   
R O   E O   R L   E O   R L   R O   E O   E O   R L   E L   
E O   R L   E O   R L   E O   R L   E O   R O   E L   E O   
E O   E O   E O   E L   E O   E O   R L   R L   R O   R L   
E L   E O   R L   R O   E O   E O   E O   E L   R O   R L 

Data Entry

From the format above we can simply copy and paste the data into R and then manipulate it into standard (tidy) matrix as shown below.

string <- "E O   E L   E L   R O   E O   E O   E O   R L   R O   R L   
R O   E O   R L   E O   R L   R O   E O   E O   R L   E L   
E O   R L   E O   R L   E O   R L   E O   R O   E L   E O   
E O   E O   E O   E L   E O   E O   R L   R L   R O   R L   
E L   E O   R L   R O   E O   E O   E O   E L   R O   R L"

data <- string %>% 
    gsub("   ","\n",.,perl = T) %>% 
    gsub("\n\n","\n",.,perl = T) %>% 
    read.table(text = .,col.names = c("BUS_ID","Late"))
data
##    BUS_ID Late
## 1       E    O
## 2       E    L
## 3       E    L
## 4       R    O
## 5       E    O
## 6       E    O
## 7       E    O
## 8       R    L
## 9       R    O
## 10      R    L
## 11      R    O
## 12      E    O
## 13      R    L
## 14      E    O
## 15      R    L
## 16      R    O
## 17      E    O
## 18      E    O
## 19      R    L
## 20      E    L
## 21      E    O
## 22      R    L
## 23      E    O
## 24      R    L
## 25      E    O
## 26      R    L
## 27      E    O
## 28      R    O
## 29      E    L
## 30      E    O
## 31      E    O
## 32      E    O
## 33      E    O
## 34      E    L
## 35      E    O
## 36      E    O
## 37      R    L
## 38      R    L
## 39      R    O
## 40      R    L
## 41      E    L
## 42      E    O
## 43      R    L
## 44      R    O
## 45      E    O
## 46      E    O
## 47      E    O
## 48      E    L
## 49      R    O
## 50      R    L

Since we copied and pasted the data, we checked a couple random elements to see if it matched the original data, and our data contains 50 observations (5 rows with 10 pairs of values each), we can be assured that this is the complete dataset.

Data Exploration

The table below shows that there are more Express bus observations than Regular bus Observations. Below the table is a visual.

bus_data <- data %>% 
    group_by(BUS_ID) %>% 
    summarise(n = n())
bus_data
## # A tibble: 2 x 2
##   BUS_ID     n
##   <chr>  <int>
## 1 E         29
## 2 R         21
ggplot(bus_data, aes(x = BUS_ID, y = n,
                     fill = BUS_ID))+
    geom_col()

From the graph and table below we can see that there are more on-time buses than there are late buses. There is also a bar chart below for a visual.

late_data <- data %>% 
    group_by(Late) %>% 
    summarise(n = n())
late_data
## # A tibble: 2 x 2
##   Late      n
##   <chr> <int>
## 1 L        20
## 2 O        30
ggplot(late_data, aes(x = Late, y = n,
                     fill = Late))+
    geom_col()

CrossTable(x = data$BUS_ID,y = data$Late,
           prop.chisq = F)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  50 
## 
##  
##              | data$Late 
##  data$BUS_ID |         L |         O | Row Total | 
## -------------|-----------|-----------|-----------|
##            E |         7 |        22 |        29 | 
##              |     0.241 |     0.759 |     0.580 | 
##              |     0.350 |     0.733 |           | 
##              |     0.140 |     0.440 |           | 
## -------------|-----------|-----------|-----------|
##            R |        13 |         8 |        21 | 
##              |     0.619 |     0.381 |     0.420 | 
##              |     0.650 |     0.267 |           | 
##              |     0.260 |     0.160 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |        20 |        30 |        50 | 
##              |     0.400 |     0.600 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

As we can see from above we have a two-way table between bus type and status. This is the same layout as SAS’s crosstab. As we can see there are 4 values in each cell. Note the legend above the table. The 1st value in the cell is the frequency. The 2nd value in each cell is the frequency divided by the row count (row percent). the 3rd value in each cell the the frequency divided by the column count (column percent). And lastly, the 4th value is the frequency divided by the total count.

Also note from the table above that the row percent of the on-time Express buses are almost twice that of the on-time Regular buses (75.9% vs. 38.1%). It can also be said that the Late Regular buses are more than 2.5 times that of the Late Express bus (61.9% vs. 24.1%).

Below is a simpler version of the cross table above.

table(data)
##       Late
## BUS_ID  L  O
##      E  7 22
##      R 13  8

From the table above, Note that for the Express bus, the majority of the buses are on time. However, the Regular bus tend to run late more than it runs on time in this data.

We can visualize this in the graphs below.

# Get data in a nice to graph format
a <- data %>%
    group_by(BUS_ID,Late) %>%
    summarise(Percent = n(),
              .groups = "keep")
a
## # A tibble: 4 x 3
## # Groups:   BUS_ID, Late [4]
##   BUS_ID Late  Percent
##   <chr>  <chr>   <int>
## 1 E      L           7
## 2 E      O          22
## 3 R      L          13
## 4 R      O           8
ggplot(a, aes(x = BUS_ID, y = Percent, fill = Late))+
    geom_bar(stat = "identity",position = "dodge")+
    facet_grid(.~Late)

ggplot(a, aes(x = Late, y = Percent, fill = BUS_ID))+
    geom_bar(stat = "identity",position = "dodge")+
    facet_grid(.~BUS_ID)

As we can see, this graph matches what the table told us, Express bus seems to be more on-time while the Regular bus runs late more often than it is on-time.

Another way to visualize it is:

ggplot(data, aes(x = BUS_ID, fill = Late))+
    geom_bar()

From this stacked graph we can see that there were more observation with the Express buses.

The next graph will be a Percent stacked bar chart.

ggplot(a, aes(x = BUS_ID,y = Percent, fill = Late))+
    geom_bar(position = "fill",stat="identity")

In this visual we can see that there are more on-time Express buses than there are late ones. And there are more Late Regular buses than there are on-time Regular buses. Note that we used the row percentages discussed earlier.

Explore the question

Now we can ask the question: Are the proportions of being on time the same for each bus type?, which is represented by the blue bars in the graph above.

One test we could perform is the Chi-Square test. The assumtions in this test are:

  1. Categorical Variables (yes)
  2. Independent observations (Assume)
  3. Cells in contingency table are mutually exclusive (Can’t have a bus thats on-time and late, yes)
  4. Expected value of cells should be 5 or greater in 80% of cells (see below)

Checking assumption 4

table(data)
##       Late
## BUS_ID  L  O
##      E  7 22
##      R 13  8
EL <- (30*28)/50;EL
## [1] 16.8
EO <- (28*30)/50;EO
## [1] 16.8
RL <- (30*21)/50;RL
## [1] 12.6
RO <- (30*21)/50;RO
## [1] 12.6

As we can see all the Expected values are greater than 5. Thus we can use the Chi-Square distribution to see if there is a difference in proportion of the late regular buses compared to the late express buses.

The Chi-Square test is given as:

table_data <- data %>% 
  table()
table_data
##       Late
## BUS_ID  L  O
##      E  7 22
##      R 13  8
chisq.test(table_data)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table_data
## X-squared = 5.7505, df = 1, p-value = 0.01648

From the Chi-Square test above, since the p-value is less than .05, the data is telling us that the difference in proportion of on-time Express buses is different from that of the on-time Regular buses.

We could also use the Student t-test to see if the proportions are truly different.

T-Test Assumptions

  1. Data is continuous or ordinal (yes)
  2. sample is a good representation of population (assume)
  3. data results in a normal distribution (see below)
  4. reasonably large sample size (assume)
  5. homogeneity of variance (see below)

Checking assumption 3

# Assuming sample sizes are independent
P1 <- .619; N1 <- 21; S1 <- P1*N1*(1-P1); S1
## [1] 4.952619
P2 <- .241; N2 <- 29; S2 <- P2*N2*(1-P2); S2
## [1] 5.304651
# Since both Standard Deviations aren't close to 0 we can estimate our binomial data with a normal table (or a t distribution for a more conservative estimate).

Note that our standard deviation for express and regular buses are not close to 0 we can estimate this binomial data with a normal table.

Checking assumption 5

Regular <- subset(data,BUS_ID == "R")%>% 
  mutate(Late = if_else(Late == "O",0,1))# 0 means the bus was on-time

Express <- subset(data,BUS_ID == "E")%>% 
  mutate(Late = if_else(Late == "O",0,1))# 0 means the bus was on-time

var.test(Regular$Late,Express$Late)
## 
##  F test to compare two variances
## 
## data:  Regular$Late and Express$Late
## F = 1.3056, num df = 20, denom df = 28, p-value = 0.5065
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.584851 3.088686
## sample estimates:
## ratio of variances 
##           1.305628
#since p-value > .05 we can assume that the variance of both samples are homogeneous
#thus we can run a Student's two-sample t-test and set the parameter var.equal to be TRUE

Next we can see from our F test (var.test) that the variance is homogeneous.

test <- t.test(Regular$Late,Express$Late, var.equal = T); test
## 
##  Two Sample t-test
## 
## data:  Regular$Late and Express$Late
## t = 2.8505, df = 48, p-value = 0.006415
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1112769 0.6440598
## sample estimates:
## mean of x mean of y 
## 0.6190476 0.2413793

Then from the t-test above, we are determining if the proportion of Late Express buses was the same as the proportion of Late Regular buses. Since the p-value is small on this test we that from this data, there is a difference in proportion between late express buses and late regular buses.

Now we have two test telling us that the proportions of being late for the bus types are different. And looking at the graphs and table above we can see that it is the Express bus that tends to be more on time. These results are best summarised in the cross table above.