library(tidyverse)
library(readxl)
library(portfolio)

Midterm Part 1

Read in data

df <- readRDS("gni2014.Rda")

Making the tree map

This data represents the Gross National Income (per capita) in dollars and population totals per country in 2014.

map.market(id=df$country, 
           group=df$continent, 
           area=df$population, 
           color=df$GNI / 1000,
           scale=107,
           main="GNI in $1000s per Country/Population"
           )

The size of the box represents the total population and the color density represents the Gross National Income * in incremenets of 1000, lightest green being the highest income and darker squares being lower income. It is scaled by $1000’s to represent much easier to understand number values than having it vary from $250-106140. It is divided into continents to represent regions of countries as a whole rather than individual countries to prevent a large amount of clustered boxes.

Midterm Part 2

(OPTIONAL)

Download golf data to make it consistent with the rest of the data

Output of my copy and paste into excel: https://1drv.ms/x/s!AmtFQLCvRkiegsNLDYpLWAsoHsQblw

Code doesn’t actually work, spent a good bit trying to get it to go but it would probably be easier if I had access to an actual web server that could host the file with a normal extension for the link rather than the jumbled mess onedrive or other providers give me, but anyway

#download.file(url="https://1drv.ms/x/s!AmtFQLCvRkiegsNLDYpLWAsoHsQblw", dest="golf.xlsx")

Read in golf data

golf <- read_excel("golf.xlsx", 
    col_types = c("numeric", "text", "text", 
        "text", "text", "numeric", "numeric"))

Setting fill colors for US winners

fill_colors_us <- c()
for ( i in 1:length(golf$Country) ) {
    if (golf$Country[i] == "United States") {
        fill_colors_us <- c(fill_colors_us, "red")
    } else {
        fill_colors_us <- c(fill_colors_us, "#cccccc")
    }
}

United States winners

barplot(golf$`Total score`, names.arg=golf$Year,col=fill_colors_us, space=.7,
        xlab="Year", ylab="Total Score",
        main="Winners of the US Open golf
tournament (US Winners highlighted red)")

List of names and years won by golfers from the United States

golf %>% 
  select(Champion,Year,Country) %>%
  filter(Country == "United States")

Detect duplicates/set colors for repeat winners

golf$repeatwinner <- duplicated(golf$Champion)

fill_colors_winners <- c()
for ( i in 1:length(golf$repeatwinner) ) {
    if (golf$repeatwinner[i] == "TRUE") {
        fill_colors_winners <- c(fill_colors_winners, "cyan")
    } else {
        fill_colors_winners <- c(fill_colors_winners, "#cccccc")
    }
}

Repeat Winners

barplot(golf$`Total score`, names.arg=golf$Year,col=fill_colors_winners, space=.7,
        xlab="Year", ylab="Total Score",
        main="Winners of the US Open golf tournament (Repeat Winners highlighted Blue)")

List of names and years of repeat winners

golf %>% 
  filter(repeatwinner == "TRUE") %>%
  select(Champion,Year,Country)

Midterm Part 3

Time series. Consider the Air Passengers data; in R type data(“AirPassengers”). Find an appropriate decomposition for the data. Create ACF and PACF plots and assess whether it is white noise. Transform and difference the data as necessary to try and get a result that is close to white noise (this might not be perfect). Give a visualization that justifies this result, and comment.

plot(decompose(AirPassengers,"multiplicative"))

plot(AirPassengers)

ap.d <- diff(log(AirPassengers))

ap.dx <- log(AirPassengers)
plot(ap.dx)

plot(ap.d)

We take a log of the data observed to give us a much more uniform scale across the years to give a much more palatable trend, and then we take the lagged differences to remove trend.

We have removed the trend from the data in the second plot, as you can plot a horizontal line through the data and it would fit the trend.

acf(ap.d)

We see there is still a seasonality component to the data, and it is measured monthly as each line in is an observation of a month.

To eliminate the seasonal data, we set the frequency of seasonality to 12

ap.ds <- diff(log(AirPassengers), 12)

plot(ap.ds)

acf(ap.ds)

We are then left with something very close to white noise, as we can see from the automatically decomposed graph, our results look very similar.

We do see some down trends remaining in the observations of the ACF for part of the first year, meaning there is still some violations and the data is still seemingly correlated for the first few instances (thus doesn’t seem to be pure white noise), but it immediately goes away before the end of the first year.

R Notebook