Week 3

Computer Workshop

Load the library MASS. Explore the data set quine and write a short paragraph describing the dataset. Use the functions in dplyr to produce tables to showcase

  1. The average number of days that aboriginal and non-aboriginal children absent from school in the year.

    library(dplyr)
    library(tidyr)
    library(MASS)
    
    z <- quine
    
    # tapply(z$Days, list(z$Eth), mean, na.rm=T)
    
    z  %>% group_by(Eth) %>% summarise(avg=mean(Days, na.rm=T)) %>% spread(Eth, avg)
    ## # A tibble: 1 × 2
    ##       A     N
    ##   <dbl> <dbl>
    ## 1  21.2  12.2
  2. The median number of days that aboriginal and non-aboriginal children absent from school in the year by age group and gender.

    # tapply(z$Days, list(z$Eth, z$Sex, z$Age), median, na.rm=T)
    
    z %>% group_by(Eth, Sex, Age) %>% summarise(med=median(Days, na.rm=T)) %>% spread(Eth, med)
    ## # A tibble: 8 × 4
    ## # Groups:   Sex [2]
    ##   Sex   Age       A     N
    ##   <fct> <fct> <dbl> <dbl>
    ## 1 F     F0       11    20
    ## 2 F     F1       13     6
    ## 3 F     F2       20     4
    ## 4 F     F3       10    12
    ## 5 M     F0       12     7
    ## 6 M     F1        7     5
    ## 7 M     F2       32    12
    ## 8 M     F3       28    27
  3. The difference in the average number of absentee days between boys and girls of Aboriginal background with slow learner status.

    z %>% filter(Eth=="A", Lrn=="SL") %>% group_by(Sex) %>% summarise(avg=mean(Days, na.rm=T)) %>% spread(Sex, avg)
    ## # A tibble: 1 × 2
    ##       F     M
    ##   <dbl> <dbl>
    ## 1  27.4  20.2

Produce data visualisations (figures) to show

  1. The difference in the absentee days between aboriginal and non-aboriginal children.

    library(ggplot2)
    
    
    z1 <- z  %>% group_by(Eth)
    
    g1 <- ggplot(z1, aes(x=Eth, y=Days, fill=Eth)) +
      geom_boxplot() +
      stat_summary(fun = mean, geom = "point", shape = 18, size = 4, color = "black") +
      labs(title = "Absentee days between aboriginal and non-aboriginal children",
           x="Ethnicity",
           y="Absentee days") +
      theme(legend.position="None")
    g1

  2. The difference in the absentee days between aboriginal and non-aboriginal children by age groups.

    z2 <- z %>% group_by(Eth, Age)
    # z2
    
    g2 <- ggplot(z2, aes(x=Age, y=Days, fill=Eth)) +
      geom_boxplot() +
      labs(title = "Absentee days between aboriginal and non-aboriginal children by age group",
           x="Age Group",
           y="Absentee days",
           fill="Ethnicity")
    g2

  3. The difference in the absentee days between boys and girls of aboriginal background and write a half-page report to summarise your findings.

    z3 <- z %>% filter(Eth=="A") %>% group_by(Sex)
    # z3
    
    g3 <- ggplot(z3, aes(x=Sex, y=Days, fill=Sex)) +
      geom_boxplot() +
      stat_summary(fun = mean, geom = "point", shape = 18, size = 4, color = "black") +
      labs(title = "Absentee days between boys and girls of aboriginal background",
           x="Sex",
           y="Absentee days",
           fill="Sex") +
      theme(legend.position="None")
    g3

Aboriginal children have more absentee days than non-aboriginal children. When examining age groups we see that this trend properly starts to occur in F2, as F1 has pretty equal distributions. Amongst aboriginal children, males tend to have more absent days than females, although they are quite close in numbers.

Week 4

Computer workshop

Part 1 - 10 points

In the workshop 3 folder on learnJCU, you will find three most recent data on Covid-19

  1. tested done for Covid-19 (“covid-19-tests-country.csv”)

  2. confirmed cases (“total-cases-covid-19-who.csv”)

  3. deaths due to Covid 19 (“total-deaths-covid-19.csv”)

The “Year” is the number of day since 21st Jan 2020.

Use ggplot to generate publishable figures that show

  • (1 point) changes in the number of confirmed cases since 21st Jan 2020 (pooled across all countries). (Note: You need to replace the label in your figure so it reflects the actual day)

    cases <- read.csv("total-cases-covid-19-who.csv")
    
    
    pooled_cases <- cases %>% group_by(Year) %>% summarise(total_cases=sum(Total.confirmed.cases.of.COVID.19, na.rm=T))
    
    start_date <- as.Date("2020-01-21")
    
    pooled_cases <- pooled_cases %>% mutate(Year=start_date + Year)
    
    x_breaks <- c("2020-01-25","2020-02-01","2020-02-08","2020-02-15","2020-02-22","2020-02-29","2020-03-07","2020-03-14")
    x_breaks <- as.Date(x_breaks)
    y_breaks <- c(0,25000,50000,75000,100000,125000)
    
    
    ggplot(pooled_cases, aes(x=Year,y=total_cases)) +
      geom_line(col="steelblue") +
      labs(title="Confirmed global COVID-19 cases between January 21 to March 14, 2020", x="Date", y="Total Cases") +
      scale_y_continuous(labels = scales::number_format(), breaks = y_breaks) +
      scale_x_continuous(breaks=x_breaks,labels=c("Jan 25", "Feb 1", "Feb 8", "Feb 15", "Feb 22", "Feb 29", "Mar 7", "Mar 14")) +
      theme_classic()

  • (1 point) changes in the number of confirmed cases since day 21st Jan 2020 by country. (Note: Pick five countries of your choice, including China. You need to replace the label in your figure so it reflects the actual day)

    countries <- c("China", "Canada", "Argentina", "Australia", "Germany")
    
    z %>% filter(Eth=="A", Lrn=="SL") %>% group_by(Sex) %>% summarise(avg=mean(Days, na.rm=T)) %>% spread(Sex, avg)
    ## # A tibble: 1 × 2
    ##       F     M
    ##   <dbl> <dbl>
    ## 1  27.4  20.2
    country_cases <- cases %>% filter(Entity %in% countries)
    
    country_cases <- country_cases %>% mutate(Year=start_date + Year)
    
    ggplot(country_cases, aes(x=Year,y=Total.confirmed.cases.of.COVID.19, col=Entity)) +
      geom_line() +
      labs(title="Confirmed COVID-19 cases between January 21 to March 14, 2020", x="Date", y="Total Cases", color="Country") +
      scale_x_continuous(breaks=x_breaks,labels=c("Jan 25", "Feb 1", "Feb 8", "Feb 15", "Feb 22", "Feb 29", "Mar 7", "Mar 14")) +
      theme_classic()

  • (1 point) Changes in the number of deaths due to Covid-19 since 21st Jan 2020 by country. (Note: Pick five countries of your choice, including China. You need to replace the label in your figure so it reflects the actual day)

    deaths <- read.csv("total-deaths-covid-19-who.csv")
    
    
    country_deaths <- deaths %>% filter(Entity %in% countries)
    country_deaths <- country_deaths %>% mutate(Year=start_date + Year)
    
    
    ggplot(country_deaths, aes(x=Year,y=Total.confirmed.deaths.due.to.COVID.19, col=Entity)) +
      geom_line() +
      labs(title="COVID-19 deaths between January 21 to March 14, 2020", x="Date", y="Total deaths", color="Country") +
      scale_x_continuous(breaks=x_breaks,labels=c("Jan 25", "Feb 1", "Feb 8", "Feb 15", "Feb 22", "Feb 29", "Mar 7", "Mar 14")) +
      theme_classic()

  • (4 point) Case fatality rate (CFR) is the proportion of people who die from a specified disease among all individuals diagnosed with the disease over a certain period of time. Case fatality rate typically is used as a measure of disease severity and is often used for prognosis (predicting disease course or outcome), where comparatively high rates are indicative of relatively poor outcomes.

    Here we want to know the CFR for the period of Day 40 to Day 53 of China, Australia, South Korea, Italy and USA. Generate a figure to compare the CFR of these countries.

    \[ CFR(\%)=\frac{\text{number of deaths from disease}}{\text{number of diagnosed cases of disease}}\cdot 100 \]

    CFR_countries <-c("China", "Australia", "South Korea", "Italy", "United States")
    deaths40to53 <- deaths %>% filter(Entity %in% CFR_countries, Year %in% 40:53)
    
    cases40to53 <- cases %>% filter(Entity %in% CFR_countries, Year %in% 40:53)
    
    dc40to53 <- inner_join(deaths40to53, cases40to53, by=c("Year","Entity"))
    
    CFR <- dc40to53 %>% mutate(CFR=100*Total.confirmed.deaths.due.to.COVID.19/Total.confirmed.cases.of.COVID.19)
    
    CFR1 <- CFR %>% group_by(Entity) %>% summarise(total_cases=sum(Total.confirmed.cases.of.COVID.19, na.rm=T), total_deaths=sum(Total.confirmed.deaths.due.to.COVID.19, na.rm=T))
    
    CFR1 <- CFR1 %>% mutate(CFR=100*total_deaths/total_cases)
    CFR1
    ## # A tibble: 5 × 4
    ##   Entity        total_cases total_deaths   CFR
    ##   <chr>               <int>        <int> <dbl>
    ## 1 Australia            1093           29 2.65 
    ## 2 China             1129592        42923 3.80 
    ## 3 Italy               96752         5453 5.64 
    ## 4 South Korea         90623          640 0.706
    ## 5 United States        6185          210 3.40
    ggplot(CFR1, aes(x=Entity, y=CFR, fill=Entity)) +
      geom_bar(stat = "identity", position = "dodge") +
      theme_minimal() +
      labs(title="CFR of Australia, China, Italy, South Korea and USA from February 29 to March 13", x="Country") +
      theme(legend.position="None")

  • (1 point) Select five countries and generate a bar graph showing the number of Covid-19 tests conducted to date by country.

    tests <- read.csv("covid-19-tests-country.csv")
    
    new_countries <- c("Finland", "Vietnam", "South Korea", "China - Guangdong", "Italy")
    
    tests1 <- tests %>% filter(Entity %in% new_countries) 
    tests1
    ##              Entity Code Year Total.COVID.19.tests.performed..Tests.
    ## 1 China - Guangdong         0                                 320000
    ## 2           Finland  FIN   16                                    900
    ## 3             Italy  ITA   17                                  86011
    ## 4       South Korea  KOR   18                                 248647
    ## 5           Vietnam  VNM   17                                   4588
    ggplot(tests1, aes(x=Entity, y=Total.COVID.19.tests.performed..Tests., fill=Entity)) +
      geom_bar(stat = "identity", position = "dodge") +
      theme_minimal() +
      labs(title="Number of Covid-19 tests conducted to date by country", x="Country", y="Total Tests") +
      theme(legend.position="None") +
      scale_y_continuous(labels = scales::number_format())

(2 point) Write a short paragraph on the Covid-19 trend based on the figures you generated.

The figures show a rapid increase in total cases across all countries during the dates given. When analysing cases in specific countries, we see that China accounts for a significant proportion of the total cases, as most other countries only get cases later then China, and are much lower as well. The deaths show a similar trend, however, the CFR figure shows that Italy has a higher case fatality rate than China, which is second highest.

Part 2 - 3 points

  • Let X be a Poisson random variable with a mean of 2. Use R’s built-in functions for the Poisson distribution to find [ 1 point each ]

    • (1 point) \(P(X=2)\)

      dpois(2,2)
      ## [1] 0.2706706
    • (1 point) \(P(X\geq 4)\)

      1 - ppois(3,2)
      ## [1] 0.1428765
    • (1 point) the smallest value of k, such that \(P(x\leq k)=\geq0.7\)

      qpois(0.7,2)
      ## [1] 3

Theory

  1. (2 points) Electromagnetic fields generated by power transmission lines are suspected by some researcher to be a causes of cancer. Especially at risk would be telephone linemen because of their frequent proximity to high-voltage wires. According to one study, two cases of a rare form of cancer were detected among a group of 9500 linemen. In the general population, the incidence of that particular condition is on the order of one in a million. What would you conclude?

    2/9500 is a much higher rate than 1 in a million. This suggests the incidence of the condition is indeed significantly higher for telephone linemen. However, this is only a correlation, and further studies would have to be conducted to determine whether the relationship is causal.

  2. (2 points) During the latter part of the nineteenth century. Prussian officials gathered information relating to the hazards that horses posed to cavalry soldiers. A total of 10 cavalry corps were monitored over a period of 20 years. Recorded for each year and each corps was X, the annual number of fatalities due to kicks. Summarised in the following table are the 200 values recorded for X. Show that these data can be modeled by a Poisson PDF

    Number of Deaths, k Observed number of Corps-Years in which k fatalities occured

    0

    1

    2

    3

    4

    109

    65

    22

    3

    1

  3. (2 points) In a new fiber optic communication system, transmission errors occur at the rate of 1.5 per 10 seconds. What is the probability that at least two errors will occur during the next half-minute?

    \[ \begin{align} E(10)= 1.5 \\ E(3\cdot10)=3\cdot1.5=4.5 \\ X\sim \text{Poisson}(4.5) \\ P(X\geq2) &= 1-P(X\leq1)\\ &=1-P(X=0)-P(X=1) \\ &=0.9389 \end{align} \]

  4. (2 points) A newly formed life insurance company has a underwritten term policies on 120 women between the ages of 40 and 44. Suppose each woman has 1/150 probability of dying during the next calendar year, and each death requires the company to pay out $50,000 in benefits. Approximate the probability that the company will have to pay at least $150,000 in benefits next year.

    \[ \begin{align} X\sim B(120,\frac{1}{150})\\ k=\frac{150000}{50000}=3 \\ P(X\geq 3)&=1-P(X<3) \\ &=1-P(X=0)-P(X=1)-P(X=2) \\ &= 0.04684 \end{align} \]

  5. (2 points) An army is soliciting proposals for the development of a truck-launched antitank missile. Pentagon officials are requiring that the automatic sighting mechanism be sufficiently reliable to guarantee that 95% of the missiles will fall no more than 50ft short of their target or no more than 50ft beyond. What is the largest σ compatible with the degree of precision? Assume the horizontal distance a missile travels is normally distributed with its mean equal to the length of the separation between the truck and the target.

    \[ P(\mu-50\leq X\leq \mu+50)=0.95 \\ P(X\leq \mu+50)-P(X\leq \mu-50) =0.95\\ \frac{\mu+50-\mu}{\sigma}-\frac{\mu-50-\mu}{\sigma}=0.95 \\ \frac{100}{\sigma}=0.95\\ \sigma=105.26...\\ \sigma=105 \]

  6. (2 points) An underground military installation is fortified to the extent that it can withstand up to three direct hits from air-to-surface missiles and still function. Suppose an enemy aircraft is armed with missiles, each having 30% chance of scoring a direct hit. What is the probability that the installation will be destroyed with the seventh missile fired?

    \[ X\sim B(6,0.30)\\ P(X=2) = {6\choose2}0.3^20.7^4\\ =0.324...\\ 0.324...\cdot30\%=0.0972\\ \]

Week 5&6

Theory

  1. Suppose that a particular candidate for public office is in fact favoured by 48% of all registered voters in a district. A polling organisation takes a random sample of 500 voters, what is the approximate probability that more than 50% of sampled voters favour the candidate?

  2. Suppose a random sample of size n is drawn from normal PDF,

    \[ f_y(y;\mu,\sigma)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2}(\frac{y-\mu}{\sigma})^2} \]

    1. Find MLEs for \(\mu\) and \(\sigma^2\)

    2. Find MM estimators for \(\mu\) and \(\sigma^2\)