Choose any of the three wide datasets identified in the week 6 discussion items. (You may choose your own) Read the information from your csv into R and use tidyR and dplyr as needed to transform the data. Perform the analysis requested in the discussion item.

Population Dataset Steven Tipton shared this data set in the discussion items The following steps are requested: 1) Melt the data into a Tidy format with just country, year, and population 2) Explore the NA’s and – and research notable events that could have caused change in population 3) Remove entries for Eurasia, North America, and World such as to only have country 4) See how population changes around the world after political shifts 5) Add a column that classifies countries to continents

This data set stood out to me because I think it is really interesting to see how populations changed over time for certain nations. There are also reasons why there have been population shifts and I want to highlight some of them in my analysis in a very easy to read visualization.

Load Libraries Any additional libraries will be loaded in as needed

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(RCurl)
## Loading required package: bitops
## 
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
## 
##     complete

Read in the data which can be found using the link below: https://raw.githubusercontent.com/vindication09/DATA-607-Project-2/master/populationbycountry19802010millions.csv

I will read in the data from a location on my own machine

library(readr)
populationbycountry19802010millions <- read_csv("~/Desktop/DATA Science SPS/DATA 607/Week 5:6/Project2Data/populationbycountry19802010millions.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_character()
## )
## See spec(...) for full column specifications.
head(populationbycountry19802010millions)
## # A tibble: 6 x 32
##   X1      `1980`  `1981`  `1982` `1983` `1984` `1985` `1986` `1987` `1988`
##   <chr>   <chr>   <chr>   <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
## 1 North … 320.27… 324.44… 328.6… 332.7… 336.7… 340.7… 344.8… 349.0… 353.2…
## 2 Bermuda 0.05473 0.05491 0.055… 0.055… 0.055… 0.056… 0.056… 0.056… 0.057…
## 3 Canada  24.5933 24.9    25.20… 25.45… 25.70… 25.94… 26.20… 26.54… 26.89…
## 4 Greenl… 0.05021 0.05103 0.051… 0.052… 0.052… 0.053… 0.053… 0.0541 0.054…
## 5 Mexico  68.347… 69.969… 71.64… 73.36… 75.08… 76.76… 78.44… 80.12… 81.78…
## 6 Saint … 0.00599 0.00601 0.006… 0.006… 0.006… 0.006… 0.006… 0.006… 0.006…
## # ... with 22 more variables: `1989` <chr>, `1990` <chr>, `1991` <chr>,
## #   `1992` <chr>, `1993` <chr>, `1994` <chr>, `1995` <chr>, `1996` <chr>,
## #   `1997` <chr>, `1998` <chr>, `1999` <chr>, `2000` <chr>, `2001` <chr>,
## #   `2002` <chr>, `2003` <chr>, `2004` <chr>, `2005` <chr>, `2006` <chr>,
## #   `2007` <chr>, `2008` <chr>, `2009` <chr>, `2010` <chr>

I want to rename column 1. It is currently called X1, however I want this to be my country column

colnames(populationbycountry19802010millions)[which(names(populationbycountry19802010millions) == "X1")] <- "Country"
head(populationbycountry19802010millions, 2)
## # A tibble: 2 x 32
##   Country  `1980`  `1981` `1982` `1983` `1984` `1985` `1986` `1987` `1988`
##   <chr>    <chr>   <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
## 1 North A… 320.27… 324.4… 328.6… 332.7… 336.7… 340.7… 344.8… 349.0… 353.2…
## 2 Bermuda  0.05473 0.054… 0.055… 0.055… 0.055… 0.056… 0.056… 0.056… 0.057…
## # ... with 22 more variables: `1989` <chr>, `1990` <chr>, `1991` <chr>,
## #   `1992` <chr>, `1993` <chr>, `1994` <chr>, `1995` <chr>, `1996` <chr>,
## #   `1997` <chr>, `1998` <chr>, `1999` <chr>, `2000` <chr>, `2001` <chr>,
## #   `2002` <chr>, `2003` <chr>, `2004` <chr>, `2005` <chr>, `2006` <chr>,
## #   `2007` <chr>, `2008` <chr>, `2009` <chr>, `2010` <chr>

I want to create a column that classifies countries into continents. There is a library that allows us to do this called “countrycode.” Before we use this package, we need to do some data prep by removing entries that cant be classified. The classification only works for the mapping country to continent

lets remove items that are not countries such as North America,Eurasia… -North America -Central & South America -Antarctica -Eurasia -Middle East -Asia & Oceania -World -Africa -Europe -Former Czechoslovakia -Former Serbia and Montenegro -Former Yugoslavia -East -Hawaiian Trade Zone -U.S. Pacific Islands -Wake Island -Former U.S.S.R.

It was easy to identify which entries to remove. The first time I ran this function, I was given a warning message that showed a list of entries that could not be classified. I simply copy and pasted this list from the warning and used it as my driving vector to remove them from my data.

#Make a vector that lists all the countries that could not be classified by country code
remove<- c('North America', 'Central & South America', 'Antarctica', 
                                'Eurasia', 'Middle East', 
                                'Asia & Oceania', 
                                'World', 'Africa', 'Europe', 
           'Former Czechoslovakia', 'Former Serbia and Montenegro', 'Former Yugoslavia', 
           'East', 'Hawaiian Trade Zone', 'U.S. Pacific Islands', 'Wake Island', 'Former U.S.S.R.')

dfpop <- populationbycountry19802010millions[ !grepl(paste(remove, collapse="|"), populationbycountry19802010millions$Country),]

head(dfpop, 2)
## # A tibble: 2 x 32
##   Country `1980`  `1981`  `1982` `1983` `1984` `1985` `1986` `1987` `1988`
##   <chr>   <chr>   <chr>   <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
## 1 Bermuda 0.05473 0.05491 0.055… 0.055… 0.055… 0.056… 0.056… 0.056… 0.057…
## 2 Canada  24.5933 24.9    25.20… 25.45… 25.70… 25.94… 26.20… 26.54… 26.89…
## # ... with 22 more variables: `1989` <chr>, `1990` <chr>, `1991` <chr>,
## #   `1992` <chr>, `1993` <chr>, `1994` <chr>, `1995` <chr>, `1996` <chr>,
## #   `1997` <chr>, `1998` <chr>, `1999` <chr>, `2000` <chr>, `2001` <chr>,
## #   `2002` <chr>, `2003` <chr>, `2004` <chr>, `2005` <chr>, `2006` <chr>,
## #   `2007` <chr>, `2008` <chr>, `2009` <chr>, `2010` <chr>

We also need to convert the data into a data frame or else country code will not work.

df.pop2<-data.frame(dfpop)
nrow(df.pop2)
## [1] 212
head(df.pop2)
##                     Country     X1980     X1981     X1982     X1983
## 1                   Bermuda   0.05473   0.05491   0.05517   0.05551
## 2                    Canada   24.5933      24.9   25.2019   25.4563
## 3                 Greenland   0.05021   0.05103   0.05166   0.05211
## 4                    Mexico  68.34748  69.96926   71.6409  73.36288
## 5 Saint Pierre and Miquelon   0.00599   0.00601   0.00605   0.00607
## 6             United States 227.22468 229.46571 231.66446 233.79199
##      X1984    X1985     X1986     X1987     X1988     X1989     X1990
## 1  0.05585  0.05618   0.05651   0.05683   0.05717   0.05749   0.05778
## 2  25.7018  25.9416   26.2038   26.5497   26.8948   27.3793   27.7906
## 3  0.05263  0.05315   0.05364    0.0541   0.05485   0.05541   0.05563
## 4 75.08014 76.76723  78.44243  80.12249  81.78182  83.36684  84.91365
## 5  0.00611  0.00616   0.00621   0.00625   0.00628   0.00631   0.00632
## 6 235.8249 237.9238 240.13289 242.28892 244.49898 246.81923 249.62281
##       X1991     X1992     X1993     X1994     X1995     X1996     X1997
## 1    0.0581    0.0587   0.05924   0.05975   0.06029   0.06087   0.06145
## 2   28.1179  28.54489  28.95334  29.33081  29.69053  30.02632   30.3056
## 3   0.05554   0.05549   0.05564   0.05592   0.05619   0.05634   0.05651
## 4  86.48803  88.11103  89.74914   91.3379  92.88035  94.39858  95.89515
## 5   0.00633   0.00636   0.00638    0.0064    0.0064   0.00641   0.00642
## 6 252.98094 256.51422 259.91859 263.12582 266.27839 269.39428 272.64693
##      X1998     X1999     X2000     X2001     X2002     X2003     X2004
## 1  0.06198   0.06251   0.06306   0.06361   0.06418   0.06476   0.06534
## 2 30.55166  30.82026  31.09956  31.37674  31.64096  31.88931  32.13476
## 3  0.05661    0.0567   0.05689   0.05713   0.05736   0.05754    0.0577
## 4 97.32506  98.61691  99.92662 101.24696 102.47993 103.71806 104.95959
## 5  0.00642   0.00643   0.00641   0.00637   0.00633   0.00629   0.00625
## 6 275.8541 279.04017 282.17196 285.08156 287.80391 290.32642 293.04574
##       X2005     X2006     X2007     X2008     X2009     X2010
## 1   0.06591   0.06644   0.06692   0.06739   0.06784   0.06827
## 2  32.38638  32.65668  32.93596   33.2127  33.48721  33.75974
## 3   0.05778   0.05764   0.05753   0.05756    0.0576   0.05764
## 4  106.2029 107.44953 108.70089  109.9554 111.21179 112.46886
## 5    0.0062   0.00615    0.0061   0.00605     0.006   0.00594
## 6 295.75315 298.59321  301.5799 304.37485 307.00655 310.23286

I am not sure why this happened but by converting to a data frame, all of my year columns have an X in front of the name. We can use some regular expression to clean the column names.

#remove the x
names(df.pop2) <- gsub(x = names(df.pop2), pattern = "\\X", replacement = "")  
names(df.pop2)
##  [1] "Country" "1980"    "1981"    "1982"    "1983"    "1984"    "1985"   
##  [8] "1986"    "1987"    "1988"    "1989"    "1990"    "1991"    "1992"   
## [15] "1993"    "1994"    "1995"    "1996"    "1997"    "1998"    "1999"   
## [22] "2000"    "2001"    "2002"    "2003"    "2004"    "2005"    "2006"   
## [29] "2007"    "2008"    "2009"    "2010"
head(df.pop2, 2)
##   Country    1980    1981    1982    1983    1984    1985    1986    1987
## 1 Bermuda 0.05473 0.05491 0.05517 0.05551 0.05585 0.05618 0.05651 0.05683
## 2  Canada 24.5933    24.9 25.2019 25.4563 25.7018 25.9416 26.2038 26.5497
##      1988    1989    1990    1991     1992     1993     1994     1995
## 1 0.05717 0.05749 0.05778  0.0581   0.0587  0.05924  0.05975  0.06029
## 2 26.8948 27.3793 27.7906 28.1179 28.54489 28.95334 29.33081 29.69053
##       1996    1997     1998     1999     2000     2001     2002     2003
## 1  0.06087 0.06145  0.06198  0.06251  0.06306  0.06361  0.06418  0.06476
## 2 30.02632 30.3056 30.55166 30.82026 31.09956 31.37674 31.64096 31.88931
##       2004     2005     2006     2007    2008     2009     2010
## 1  0.06534  0.06591  0.06644  0.06692 0.06739  0.06784  0.06827
## 2 32.13476 32.38638 32.65668 32.93596 33.2127 33.48721 33.75974

The data is now prepped for country code. This is going to scan the Country columns and produce a list of their associated Continents in its own data frame. We can append this data frame to our original df.pop2 to complete the task of adding a contient column to each country.

#Lets classify country names into their Continents using the countrycode library 
library(countrycode)
df.continent <- as.factor(countrycode(sourcevar = df.pop2[, "Country"], origin = "country.name", destination = "continent"))
df.continent<-data.frame(df.continent)
nrow(df.continent)
## [1] 212
head(df.continent)
##   df.continent
## 1     Americas
## 2     Americas
## 3     Americas
## 4     Americas
## 5     Americas
## 6     Americas
#now append continent as its own column 
df.countries<-data.frame(df.continent, df.pop2)
head(df.countries)
##   df.continent                   Country     X1980     X1981     X1982
## 1     Americas                   Bermuda   0.05473   0.05491   0.05517
## 2     Americas                    Canada   24.5933      24.9   25.2019
## 3     Americas                 Greenland   0.05021   0.05103   0.05166
## 4     Americas                    Mexico  68.34748  69.96926   71.6409
## 5     Americas Saint Pierre and Miquelon   0.00599   0.00601   0.00605
## 6     Americas             United States 227.22468 229.46571 231.66446
##       X1983    X1984    X1985     X1986     X1987     X1988     X1989
## 1   0.05551  0.05585  0.05618   0.05651   0.05683   0.05717   0.05749
## 2   25.4563  25.7018  25.9416   26.2038   26.5497   26.8948   27.3793
## 3   0.05211  0.05263  0.05315   0.05364    0.0541   0.05485   0.05541
## 4  73.36288 75.08014 76.76723  78.44243  80.12249  81.78182  83.36684
## 5   0.00607  0.00611  0.00616   0.00621   0.00625   0.00628   0.00631
## 6 233.79199 235.8249 237.9238 240.13289 242.28892 244.49898 246.81923
##       X1990     X1991     X1992     X1993     X1994     X1995     X1996
## 1   0.05778    0.0581    0.0587   0.05924   0.05975   0.06029   0.06087
## 2   27.7906   28.1179  28.54489  28.95334  29.33081  29.69053  30.02632
## 3   0.05563   0.05554   0.05549   0.05564   0.05592   0.05619   0.05634
## 4  84.91365  86.48803  88.11103  89.74914   91.3379  92.88035  94.39858
## 5   0.00632   0.00633   0.00636   0.00638    0.0064    0.0064   0.00641
## 6 249.62281 252.98094 256.51422 259.91859 263.12582 266.27839 269.39428
##       X1997    X1998     X1999     X2000     X2001     X2002     X2003
## 1   0.06145  0.06198   0.06251   0.06306   0.06361   0.06418   0.06476
## 2   30.3056 30.55166  30.82026  31.09956  31.37674  31.64096  31.88931
## 3   0.05651  0.05661    0.0567   0.05689   0.05713   0.05736   0.05754
## 4  95.89515 97.32506  98.61691  99.92662 101.24696 102.47993 103.71806
## 5   0.00642  0.00642   0.00643   0.00641   0.00637   0.00633   0.00629
## 6 272.64693 275.8541 279.04017 282.17196 285.08156 287.80391 290.32642
##       X2004     X2005     X2006     X2007     X2008     X2009     X2010
## 1   0.06534   0.06591   0.06644   0.06692   0.06739   0.06784   0.06827
## 2  32.13476  32.38638  32.65668  32.93596   33.2127  33.48721  33.75974
## 3    0.0577   0.05778   0.05764   0.05753   0.05756    0.0576   0.05764
## 4 104.95959  106.2029 107.44953 108.70089  109.9554 111.21179 112.46886
## 5   0.00625    0.0062   0.00615    0.0061   0.00605     0.006   0.00594
## 6 293.04574 295.75315 298.59321  301.5799 304.37485 307.00655 310.23286

We rename the Continent column to complete the task

#rename, we now have added a column that classifies countries into continents 
colnames(df.countries)[which(names(df.countries) == "df.continent")] <- "Continent"
names(df.countries)
##  [1] "Continent" "Country"   "X1980"     "X1981"     "X1982"    
##  [6] "X1983"     "X1984"     "X1985"     "X1986"     "X1987"    
## [11] "X1988"     "X1989"     "X1990"     "X1991"     "X1992"    
## [16] "X1993"     "X1994"     "X1995"     "X1996"     "X1997"    
## [21] "X1998"     "X1999"     "X2000"     "X2001"     "X2002"    
## [26] "X2003"     "X2004"     "X2005"     "X2006"     "X2007"    
## [31] "X2008"     "X2009"     "X2010"

The X came back in front of the year columns. I have no answer as to why this happened but I know how to fix it.

names(df.countries) <- gsub(x = names(df.countries), pattern = "\\X", replacement = "")  
names(df.countries)
##  [1] "Continent" "Country"   "1980"      "1981"      "1982"     
##  [6] "1983"      "1984"      "1985"      "1986"      "1987"     
## [11] "1988"      "1989"      "1990"      "1991"      "1992"     
## [16] "1993"      "1994"      "1995"      "1996"      "1997"     
## [21] "1998"      "1999"      "2000"      "2001"      "2002"     
## [26] "2003"      "2004"      "2005"      "2006"      "2007"     
## [31] "2008"      "2009"      "2010"
head(df.countries)
##   Continent                   Country      1980      1981      1982
## 1  Americas                   Bermuda   0.05473   0.05491   0.05517
## 2  Americas                    Canada   24.5933      24.9   25.2019
## 3  Americas                 Greenland   0.05021   0.05103   0.05166
## 4  Americas                    Mexico  68.34748  69.96926   71.6409
## 5  Americas Saint Pierre and Miquelon   0.00599   0.00601   0.00605
## 6  Americas             United States 227.22468 229.46571 231.66446
##        1983     1984     1985      1986      1987      1988      1989
## 1   0.05551  0.05585  0.05618   0.05651   0.05683   0.05717   0.05749
## 2   25.4563  25.7018  25.9416   26.2038   26.5497   26.8948   27.3793
## 3   0.05211  0.05263  0.05315   0.05364    0.0541   0.05485   0.05541
## 4  73.36288 75.08014 76.76723  78.44243  80.12249  81.78182  83.36684
## 5   0.00607  0.00611  0.00616   0.00621   0.00625   0.00628   0.00631
## 6 233.79199 235.8249 237.9238 240.13289 242.28892 244.49898 246.81923
##        1990      1991      1992      1993      1994      1995      1996
## 1   0.05778    0.0581    0.0587   0.05924   0.05975   0.06029   0.06087
## 2   27.7906   28.1179  28.54489  28.95334  29.33081  29.69053  30.02632
## 3   0.05563   0.05554   0.05549   0.05564   0.05592   0.05619   0.05634
## 4  84.91365  86.48803  88.11103  89.74914   91.3379  92.88035  94.39858
## 5   0.00632   0.00633   0.00636   0.00638    0.0064    0.0064   0.00641
## 6 249.62281 252.98094 256.51422 259.91859 263.12582 266.27839 269.39428
##        1997     1998      1999      2000      2001      2002      2003
## 1   0.06145  0.06198   0.06251   0.06306   0.06361   0.06418   0.06476
## 2   30.3056 30.55166  30.82026  31.09956  31.37674  31.64096  31.88931
## 3   0.05651  0.05661    0.0567   0.05689   0.05713   0.05736   0.05754
## 4  95.89515 97.32506  98.61691  99.92662 101.24696 102.47993 103.71806
## 5   0.00642  0.00642   0.00643   0.00641   0.00637   0.00633   0.00629
## 6 272.64693 275.8541 279.04017 282.17196 285.08156 287.80391 290.32642
##        2004      2005      2006      2007      2008      2009      2010
## 1   0.06534   0.06591   0.06644   0.06692   0.06739   0.06784   0.06827
## 2  32.13476  32.38638  32.65668  32.93596   33.2127  33.48721  33.75974
## 3    0.0577   0.05778   0.05764   0.05753   0.05756    0.0576   0.05764
## 4 104.95959  106.2029 107.44953 108.70089  109.9554 111.21179 112.46886
## 5   0.00625    0.0062   0.00615    0.0061   0.00605     0.006   0.00594
## 6 293.04574 295.75315 298.59321  301.5799 304.37485 307.00655 310.23286

We can use tidyr to transform the data. This data is in wide format using key value pairs. I want to gather by key value pairs and create a new long data frame

#Lets gather by key value pairs and create a new data frame 
longdf.pop <- df %>% gather(data=df.countries, Population, "1980":"2010")
head(longdf.pop, 10)  
##    Continent                   Country    . Population
## 1   Americas                   Bermuda 1980    0.05473
## 2   Americas                    Canada 1980    24.5933
## 3   Americas                 Greenland 1980    0.05021
## 4   Americas                    Mexico 1980   68.34748
## 5   Americas Saint Pierre and Miquelon 1980    0.00599
## 6   Americas             United States 1980  227.22468
## 7   Americas       Antigua and Barbuda 1980    0.06855
## 8   Americas                 Argentina 1980    28.3698
## 9   Americas                     Aruba 1980         --
## 10  Americas              Bahamas, The 1980    0.20976
tail(longdf.pop, 10)
##      Continent         Country    . Population
## 6563      Asia     Philippines 2010   99.90018
## 6564   Oceania           Samoa 2010      0.192
## 6565      Asia       Singapore 2010    4.70107
## 6566   Oceania Solomon Islands 2010     0.5592
## 6567      Asia       Sri Lanka 2010   21.51399
## 6568      Asia          Taiwan 2010   23.02496
## 6569      Asia        Thailand 2010    67.0895
## 6570   Oceania           Tonga 2010    0.12258
## 6571   Oceania         Vanuatu 2010    0.22155
## 6572      Asia         Vietnam 2010   89.57113

I need to rename the year column

colnames(longdf.pop)[colnames(longdf.pop)=="."]<-"Year"
names(longdf.pop)
## [1] "Continent"  "Country"    "Year"       "Population"

I made the choice to remove rows that contained NA’s but keep the rows that were blank. This does not imply that population is zero, but rather no population was recorded or the country was not established/does not exist at that time.

longdf.pop<-na.omit(longdf.pop)
head(longdf.pop)
##   Continent                   Country Year Population
## 1  Americas                   Bermuda 1980    0.05473
## 2  Americas                    Canada 1980    24.5933
## 3  Americas                 Greenland 1980    0.05021
## 4  Americas                    Mexico 1980   68.34748
## 5  Americas Saint Pierre and Miquelon 1980    0.00599
## 6  Americas             United States 1980  227.22468

For looking at the change in populations for certain countries, we only need Country, Year, and Population. I will make a subset with just these three columns. In addition, I will examine a country that has blank entries for certain years. In this case I selected Aruba.

df.aruba<-subset(longdf.pop, Country=='Aruba', select=c(Country, Year, Population))
head(df.aruba, 30)
##      Country Year Population
## 9      Aruba 1980         --
## 221    Aruba 1981         --
## 433    Aruba 1982         --
## 645    Aruba 1983         --
## 857    Aruba 1984         --
## 1069   Aruba 1985         --
## 1281   Aruba 1986     0.0598
## 1493   Aruba 1987    0.05918
## 1705   Aruba 1988     0.0595
## 1917   Aruba 1989    0.06069
## 2129   Aruba 1990    0.06303
## 2341   Aruba 1991     0.0663
## 2553   Aruba 1992    0.06948
## 2765   Aruba 1993    0.07407
## 2977   Aruba 1994    0.07785
## 3189   Aruba 1995    0.07996
## 3401   Aruba 1996    0.08307
## 3613   Aruba 1997    0.08621
## 3825   Aruba 1998     0.0882
## 4037   Aruba 1999    0.08926
## 4249   Aruba 2000       0.09
## 4461   Aruba 2001    0.09097
## 4673   Aruba 2002    0.09217
## 4885   Aruba 2003    0.09372
## 5097   Aruba 2004    0.09546
## 5309   Aruba 2005    0.09698
## 5521   Aruba 2006     0.0985
## 5733   Aruba 2007    0.10002
## 5945   Aruba 2008    0.10154
## 6157   Aruba 2009    0.10307

After doing some research online, I learned that Aruba became a self governing country in 1986 after it gained left the Netherlands. Was there a trend in the population of Aruba after it became an established nation?

#Lets quickly visualize 
library(ggplot2)
ggplot(data=df.aruba, aes(x=Year, y=Population, group=1)) +
  geom_line(arrow = arrow())+
  geom_point()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  labs( x="Year", y="Population (In Millions)")

Now I want to examine how population changed over time after a major political event. Lets say I want to see how the population of Cuba changed after the events of the cold war. The cold war resulted in an embargo placed on Cuba. I know that at some point, there was a mass exodus of migrants coming to the United States from Cuba. Castro’s government sent many of its own citizens into exile, including top performers of their time such as Celia Cruz.

More info on the migration of Cubans can be found here: https://www.migrationpolicy.org/article/cuban-immigrants-united-states

df.cuba<-subset(longdf.pop, Country=='Cuba', select=c(Country, Year, Population))
head(df.cuba)
##      Country Year Population
## 19      Cuba 1980    9.65298
## 231     Cuba 1981    9.71198
## 443     Cuba 1982    9.78922
## 655     Cuba 1983    9.88126
## 867     Cuba 1984    9.97373
## 1079    Cuba 1985   10.06563
#Visualize the population 
ggplot(data=df.cuba, aes(x=Year, y=Population, group=1)) +
  geom_line(arrow = arrow())+
  geom_point()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  labs( x="Year", y="Population (In Millions)")

There was a massive decline in the Cuban population between 1984 and 1985. This information matches up with the document shared via link. After 1985, the Cuban population has a steady increase. The exiles loyal to the Batista regime died out while new generations were born under the Castro regime.

Movie Data Set: This data was shared by Meaghan Burke The following steps are requested: 1) Deal with missing data 2) rename columns - There are several columns so I will rename those columns that I use in my analysis.

I selected this data set because I am a huge fan of movies. I usually watch 4-5 movies a week if time permits.

As for analysis, there were no furthur requested items so I will visually examine the budget vs gross for my favorite franchise “Star Wars.” Any additional anlysis will be shown below.

Read in the data

movies <- read_csv(getURL("https://raw.githubusercontent.com/vindication09/DATA-607-Project-2/master/movie_metadata.csv"))
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 4 parsing failures.
## row # A tibble: 4 x 5 col     row col    expected   actual      file         expected   <int> <chr>  <chr>      <chr>       <chr>        actual 1  2324 budget an integer 2400000000  literal data file 2  2989 budget an integer 12215500000 literal data row 3  3006 budget an integer 2500000000  literal data col 4  3860 budget an integer 4200000000  literal data
head(movies,3)
## # A tibble: 3 x 28
##   color director_name  num_critic_for_revi… duration director_facebook_li…
##   <chr> <chr>                         <int>    <int>                 <int>
## 1 Color James Cameron                   723      178                     0
## 2 Color Gore Verbinski                  302      169                   563
## 3 Color Sam Mendes                      602      148                     0
## # ... with 23 more variables: actor_3_facebook_likes <int>,
## #   actor_2_name <chr>, actor_1_facebook_likes <int>, gross <int>,
## #   genres <chr>, actor_1_name <chr>, movie_title <chr>,
## #   num_voted_users <int>, cast_total_facebook_likes <int>,
## #   actor_3_name <chr>, facenumber_in_poster <int>, plot_keywords <chr>,
## #   movie_imdb_link <chr>, num_user_for_reviews <int>, language <chr>,
## #   country <chr>, content_rating <chr>, budget <int>, title_year <int>,
## #   actor_2_facebook_likes <int>, imdb_score <dbl>, aspect_ratio <dbl>,
## #   movie_facebook_likes <int>
nrow(movies)
## [1] 5043
names(movies)
##  [1] "color"                     "director_name"            
##  [3] "num_critic_for_reviews"    "duration"                 
##  [5] "director_facebook_likes"   "actor_3_facebook_likes"   
##  [7] "actor_2_name"              "actor_1_facebook_likes"   
##  [9] "gross"                     "genres"                   
## [11] "actor_1_name"              "movie_title"              
## [13] "num_voted_users"           "cast_total_facebook_likes"
## [15] "actor_3_name"              "facenumber_in_poster"     
## [17] "plot_keywords"             "movie_imdb_link"          
## [19] "num_user_for_reviews"      "language"                 
## [21] "country"                   "content_rating"           
## [23] "budget"                    "title_year"               
## [25] "actor_2_facebook_likes"    "imdb_score"               
## [27] "aspect_ratio"              "movie_facebook_likes"

Lets check the number of NA’s and blanks by column

colSums(is.na(movies)|movies == '')
##                     color             director_name 
##                        19                       104 
##    num_critic_for_reviews                  duration 
##                        50                        15 
##   director_facebook_likes    actor_3_facebook_likes 
##                       104                        23 
##              actor_2_name    actor_1_facebook_likes 
##                        13                         7 
##                     gross                    genres 
##                       884                         0 
##              actor_1_name               movie_title 
##                         7                         0 
##           num_voted_users cast_total_facebook_likes 
##                         0                         0 
##              actor_3_name      facenumber_in_poster 
##                        23                        13 
##             plot_keywords           movie_imdb_link 
##                       153                         0 
##      num_user_for_reviews                  language 
##                        21                        12 
##                   country            content_rating 
##                         5                       303 
##                    budget                title_year 
##                       496                       108 
##    actor_2_facebook_likes                imdb_score 
##                        13                         0 
##              aspect_ratio      movie_facebook_likes 
##                       329                         0

It looks like gross year has the most NA and or blank entries with a total of 884 This does not imply that there was no gross earnings but rather no such information was collected lets compare movie budget and gross, however to compare I would need to remove the NA’s and blanks. Hopefully this will not affect the Star Wars movies.

movies2<-na.omit(movies)
nrow(movies2) #This brings us down from 5043 rows to 2752
## [1] 3752

I want a data frame that collects movie title, budget, and gross for Star Wars Films Star Wars: Episode VII - The Force Awakens Star Wars: Episode II - Attack of the Clones Star Wars: Episode III - Revenge of the Sith Star Wars: Episode I - The Phantom Menace Star Wars: Episode VI - Return of the Jedi Star Wars: Episode V - The Empire Strikes Back Star Wars: Episode IV - A New Hope

If I am lucky, then the Star Wars movies are stil there because they did not have any missing rows. I find it time consuming to manually type the full names of the Star wars films. I will use a wild card with an anchor to subset them instead.

#we can use a wild card to select all the rows associated with star wars films 
moviecost.df<-movies2[grep("^Star Wars:", movies2$movie_title), ]
head(moviecost.df)
## # A tibble: 6 x 28
##   color director_name   num_critic_for_revi… duration director_facebook_l…
##   <chr> <chr>                          <int>    <int>                <int>
## 1 Color George Lucas                     359      140                    0
## 2 Color George Lucas                     284      142                    0
## 3 Color George Lucas                     320      136                    0
## 4 Color Richard Marqua…                  197      134                   37
## 5 Color Irvin Kershner                   223      127                  883
## 6 Color George Lucas                     282      125                    0
## # ... with 23 more variables: actor_3_facebook_likes <int>,
## #   actor_2_name <chr>, actor_1_facebook_likes <int>, gross <int>,
## #   genres <chr>, actor_1_name <chr>, movie_title <chr>,
## #   num_voted_users <int>, cast_total_facebook_likes <int>,
## #   actor_3_name <chr>, facenumber_in_poster <int>, plot_keywords <chr>,
## #   movie_imdb_link <chr>, num_user_for_reviews <int>, language <chr>,
## #   country <chr>, content_rating <chr>, budget <int>, title_year <int>,
## #   actor_2_facebook_likes <int>, imdb_score <dbl>, aspect_ratio <dbl>,
## #   movie_facebook_likes <int>

I need movie title, budget, and gross

#i only need movie_title, budget, and gross
starwars.df<-subset(moviecost.df, select=c(movie_title, budget, gross))
head(starwars.df)
## # A tibble: 6 x 3
##   movie_title                                        budget     gross
##   <chr>                                               <int>     <int>
## 1 Star Wars: Episode III - Revenge of the Sith    113000000 380262555
## 2 Star Wars: Episode II - Attack of the Clones    115000000 310675583
## 3 Star Wars: Episode I - The Phantom Menace       115000000 474544677
## 4 Star Wars: Episode VI - Return of the Jedi       32500000 309125409
## 5 Star Wars: Episode V - The Empire Strikes Back   18000000 290158751
## 6 Star Wars: Episode IV - A New Hope               11000000 460935665
starwars.df<-data.frame(starwars.df)

I want to melt this subset using the reshape library to make the use of ggplot to better.

#melt into long dataframe 
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
swmelt <- melt(starwars.df, id = 'movie_title')
head(swmelt, 15)
##                                        movie_title variable     value
## 1    Star Wars: Episode III - Revenge of the Sith    budget 113000000
## 2    Star Wars: Episode II - Attack of the Clones    budget 115000000
## 3       Star Wars: Episode I - The Phantom Menace    budget 115000000
## 4      Star Wars: Episode VI - Return of the Jedi    budget  32500000
## 5  Star Wars: Episode V - The Empire Strikes Back    budget  18000000
## 6              Star Wars: Episode IV - A New Hope    budget  11000000
## 7    Star Wars: Episode III - Revenge of the Sith     gross 380262555
## 8    Star Wars: Episode II - Attack of the Clones     gross 310675583
## 9       Star Wars: Episode I - The Phantom Menace     gross 474544677
## 10     Star Wars: Episode VI - Return of the Jedi     gross 309125409
## 11 Star Wars: Episode V - The Empire Strikes Back     gross 290158751
## 12             Star Wars: Episode IV - A New Hope     gross 460935665

Lets create a stacked bar plot that looks at the gross vs budget for each of the Star Wars Films

#I want to visually compare the performance of star wars movies by comparing budget to gross 
library("ggplot2")

ggplot() + geom_bar(aes(y = value, x = movie_title, fill =variable ), data = swmelt,stat="identity")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  labs( x="Star Wars Film", y="Dollar Amount")

It appears that the budget for each of the Star Wars films was bigger than its gross. I would be interested in knowing if this is just at the film release or does it also take DVD/Blu ray sales into account.

I want to examine a correlation between movie rating and number of social media likes. I will need movie title, imdb score, and movie facebook likes. These are also fields that can be renamed.

Lets make a subet of the three columns we want and see if it is necessary to perform any transformations

movielikesdf<-subset(movies, select=c(movie_title, imdb_score, movie_facebook_likes))
head(movielikesdf)
## # A tibble: 6 x 3
##   movie_title                                 imdb_score movie_facebook_l…
##   <chr>                                            <dbl>             <int>
## 1 Avatar                                            7.90             33000
## 2 Pirates of the Caribbean: At World's End          7.10                 0
## 3 Spectre                                           6.80             85000
## 4 The Dark Knight Rises                             8.50            164000
## 5 Star Wars: Episode VII - The Force Awakens        7.10                 0
## 6 John Carter                                       6.60             24000

I notice that there are movies that do not have any facebook likes. It should be noted that this does not imply there were no likes but for some reason, that information was not collected.

There are 2181 entries that have a zero value for facebook likes. I would remove them because they would not add any inisght into the study of relationship between social media likes and movie score. Furthur data collection is required.

colSums(movielikesdf == '0')
##          movie_title           imdb_score movie_facebook_likes 
##                    0                    0                 2181

We should also check if there are NA values

colSums(is.na(movielikesdf))
##          movie_title           imdb_score movie_facebook_likes 
##                    0                    0                    0

Lets remove the zero rows and perform a simple correlation on movie rating and number of facebook likes.

movielikesdf[-row(movielikesdf)[movielikesdf == 0],]
## # A tibble: 2,862 x 3
##    movie_title                                 imdb_score movie_facebook_…
##    <chr>                                            <dbl>            <int>
##  1 Avatar                                            7.90            33000
##  2 Spectre                                           6.80            85000
##  3 The Dark Knight Rises                             8.50           164000
##  4 John Carter                                       6.60            24000
##  5 Tangled                                           7.80            29000
##  6 Avengers: Age of Ultron                           7.50           118000
##  7 Harry Potter and the Half-Blood Prince            7.50            10000
##  8 Batman v Superman: Dawn of Justice                6.90           197000
##  9 Pirates of the Caribbean: Dead Man's Chest        7.30             5000
## 10 The Lone Ranger                                   6.50            48000
## # ... with 2,852 more rows

Lets run a simple correlation test Visually

library("ggpubr")
## Loading required package: magrittr
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
## 
##     extract
ggscatter(movielikesdf, x="imdb_score", y="movie_facebook_likes" , 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "IMDB Score", ylab = "Facebook Likes")

I can get a better story with a numerical test such as a Pearson Correlation Test

x<-movielikesdf$imdb_score
y<-movielikesdf$movie_facebook_likes
res <- cor.test(x, y, 
                    method = "pearson")
res
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 18.102, df = 5041, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2209543 0.2727893
## sample estimates:
##       cor 
## 0.2470485

Our t statistic is 18.102 with 5041 degrees of freedom. Our P value is small and we also have a confidence interval. If I pick my alpha to be .05, then I can conclude that IMDB rating and facebook likes are correlated with a correlation coefficient of 0.247. The correlation coefficient shows a weak positive relationship. Because of the weak correlation coefficient, I would want more information such as if the facebook likes came after or before a movies release. Some movies are hyped up collecting lots of facebook likes before it is actually release. I would wonder if this is the case here. It would be great to be able to partition the likes into pre release likes and post release likes.

Sample Web Traffic Data This final dataset is my own data set. It is similar to the type of data I work with in media, specifically web traffic. I am unable to supply actual data due to legal but I can simluate data that looks exactly similar .

The type of data has user level granularity. When a visitor comes to our site and engages in content such as watching a video, reading an article, or checking stock quotes, they are identified with a unique id and their behaviors are on the site are collected. We usually collect about 1 TB of data a day and our engineering team aggregates the data into a monthly format.

Lets simulate some data For the sake of the problem, we will just generate 10 unique ids and randomly assign attributes to them

#This will simulate unique strings that look like user ids 
library(stringi)
x<-stri_rand_strings(10, 6)
df.x<-data.frame(as.character(x))
df.x
##    as.character.x.
## 1           ua9IWm
## 2           9l3Grm
## 3           jzjWLg
## 4           f8JwJH
## 5           0f8VGC
## 6           zRwIyt
## 7           KkvdrI
## 8           3LLLCQ
## 9           ZXVIKx
## 10          ihsK6n

We then assign each user a professional group they are associated with. Some values will include junk.

progroups<-c("Finance", "HR", "IT", "Tech", "NULL", "Finance", "IT", "HR", "Finance", "Finance")
df.progroups<-data.frame(as.character(progroups))
df.progroups
##    as.character.progroups.
## 1                  Finance
## 2                       HR
## 3                       IT
## 4                     Tech
## 5                     NULL
## 6                  Finance
## 7                       IT
## 8                       HR
## 9                  Finance
## 10                 Finance

We randomly assign users a Country. Country is usually identified using the user agent that is attached to a visitor coming to our site. The user agent also contains information on a visitors browser type or operating system. These are known as technographics.

country<-c("US","Canada","South Korea","US", "US", "Mexico", "France","Colombia","US", "Private")
df.country<-data.frame(as.character(country))
df.country
##    as.character.country.
## 1                     US
## 2                 Canada
## 3            South Korea
## 4                     US
## 5                     US
## 6                 Mexico
## 7                 France
## 8               Colombia
## 9                     US
## 10               Private

Lets randomly assign users a referral domain. This is simply where a visitor came from before coming to our site. An example would be a user seeing a link on facebook shared by my company , so they click on it. This users referral domain would simply be facebook. This attribute has been known to show lots of junk values. This could be because of private browser settings or the usage of VPNs.

refdom<-c("yahoo","NULL","fb", "Twitter", "NULL", "#g45y6", "Yahoo","ggggggggg", "Website","Facebook")
df.refdom<-data.frame(as.character(refdom))
df.refdom
##    as.character.refdom.
## 1                 yahoo
## 2                  NULL
## 3                    fb
## 4               Twitter
## 5                  NULL
## 6                #g45y6
## 7                 Yahoo
## 8             ggggggggg
## 9               Website
## 10             Facebook

Now lets simulate the number of hits on a specific page per user. This is the number of times a user engaged with the content on our finance page or our politics page. Engage is a blanket term that covers clicks, mouse hovers, and video watching. It is not unsual to see a single user coming to a page more than 100 times. We actually have a way to monitor page refreshes. Someone might land on a page and squat on that page. Each page refreshes every minute so that would count as a hit.

For now, lets say we only have 4 pages we are interested in collecting user information on

FinanceHits<-c("12","45", "89", "0", 'NULL',"20","21", "NULL", "NULL", "67")
PoliticsHits<-c("34","89","398","473", "42", '933',"57", "NULL","0", "45")
NewsHits<-c("44","4982", "879","490","624", "903", "8933", "NULL", "902","270")
BusinessHits<-c("12", "532", "90", "493", "909", "47", "3", "0", "2", "1")
df.hits<-data.frame(FinanceHits, PoliticsHits, NewsHits, BusinessHits)
df.hits
##    FinanceHits PoliticsHits NewsHits BusinessHits
## 1           12           34       44           12
## 2           45           89     4982          532
## 3           89          398      879           90
## 4            0          473      490          493
## 5         NULL           42      624          909
## 6           20          933      903           47
## 7           21           57     8933            3
## 8         NULL         NULL     NULL            0
## 9         NULL            0      902            2
## 10          67           45      270            1

lets put together our sample data frame

sample.df<-data.frame(df.x,df.progroups,df.country, df.refdom, df.hits)
sample.df
##    as.character.x. as.character.progroups. as.character.country.
## 1           ua9IWm                 Finance                    US
## 2           9l3Grm                      HR                Canada
## 3           jzjWLg                      IT           South Korea
## 4           f8JwJH                    Tech                    US
## 5           0f8VGC                    NULL                    US
## 6           zRwIyt                 Finance                Mexico
## 7           KkvdrI                      IT                France
## 8           3LLLCQ                      HR              Colombia
## 9           ZXVIKx                 Finance                    US
## 10          ihsK6n                 Finance               Private
##    as.character.refdom. FinanceHits PoliticsHits NewsHits BusinessHits
## 1                 yahoo          12           34       44           12
## 2                  NULL          45           89     4982          532
## 3                    fb          89          398      879           90
## 4               Twitter           0          473      490          493
## 5                  NULL        NULL           42      624          909
## 6                #g45y6          20          933      903           47
## 7                 Yahoo          21           57     8933            3
## 8             ggggggggg        NULL         NULL     NULL            0
## 9               Website        NULL            0      902            2
## 10             Facebook          67           45      270            1
colnames(sample.df)[colnames(sample.df)=="as.character.x."] <- "visitorID"
colnames(sample.df)[colnames(sample.df)=="as.character.progroups."] <- "ProGroup"
colnames(sample.df)[colnames(sample.df)=="as.character.country."] <- "Country"
colnames(sample.df)[colnames(sample.df)=="as.character.refdom."] <- "ReferralDomain"
head(sample.df) 
##   visitorID ProGroup     Country ReferralDomain FinanceHits PoliticsHits
## 1    ua9IWm  Finance          US          yahoo          12           34
## 2    9l3Grm       HR      Canada           NULL          45           89
## 3    jzjWLg       IT South Korea             fb          89          398
## 4    f8JwJH     Tech          US        Twitter           0          473
## 5    0f8VGC     NULL          US           NULL        NULL           42
## 6    zRwIyt  Finance      Mexico         #g45y6          20          933
##   NewsHits BusinessHits
## 1       44           12
## 2     4982          532
## 3      879           90
## 4      490          493
## 5      624          909
## 6      903           47

We now have a sample wide format that simlates web traffic data at the user level. We can proceed to performing transformations as needed on the data set.

Since this sample data is small, if I remove user 5oplek, then I will no longer have a value for hits. If I were to model on this data realistically, removing users would not be a problem because the data contains millions of users.

I want to look at the number of hits to a page grouped by professional group.

longsample <- sample.df %>% gather(data=sample.df, value, FinanceHits:BusinessHits)
## Warning: attributes are not identical across measure variables;
## they will be dropped
head(longsample, 20)  
##    visitorID ProGroup     Country ReferralDomain            . value
## 1     ua9IWm  Finance          US          yahoo  FinanceHits    12
## 2     9l3Grm       HR      Canada           NULL  FinanceHits    45
## 3     jzjWLg       IT South Korea             fb  FinanceHits    89
## 4     f8JwJH     Tech          US        Twitter  FinanceHits     0
## 5     0f8VGC     NULL          US           NULL  FinanceHits  NULL
## 6     zRwIyt  Finance      Mexico         #g45y6  FinanceHits    20
## 7     KkvdrI       IT      France          Yahoo  FinanceHits    21
## 8     3LLLCQ       HR    Colombia      ggggggggg  FinanceHits  NULL
## 9     ZXVIKx  Finance          US        Website  FinanceHits  NULL
## 10    ihsK6n  Finance     Private       Facebook  FinanceHits    67
## 11    ua9IWm  Finance          US          yahoo PoliticsHits    34
## 12    9l3Grm       HR      Canada           NULL PoliticsHits    89
## 13    jzjWLg       IT South Korea             fb PoliticsHits   398
## 14    f8JwJH     Tech          US        Twitter PoliticsHits   473
## 15    0f8VGC     NULL          US           NULL PoliticsHits    42
## 16    zRwIyt  Finance      Mexico         #g45y6 PoliticsHits   933
## 17    KkvdrI       IT      France          Yahoo PoliticsHits    57
## 18    3LLLCQ       HR    Colombia      ggggggggg PoliticsHits  NULL
## 19    ZXVIKx  Finance          US        Website PoliticsHits     0
## 20    ihsK6n  Finance     Private       Facebook PoliticsHits    45

We now have a long data frame that can be subset to answer the first question. Compare the number of hits to page type by professional groups. I notive that there are NULL values. Since they won’t add anything to the analysis, I can remove those rows.

longsample2<-longsample[!grepl("NULL", longsample$value),]
head(longsample2, 20)
##    visitorID ProGroup     Country ReferralDomain            . value
## 1     ua9IWm  Finance          US          yahoo  FinanceHits    12
## 2     9l3Grm       HR      Canada           NULL  FinanceHits    45
## 3     jzjWLg       IT South Korea             fb  FinanceHits    89
## 4     f8JwJH     Tech          US        Twitter  FinanceHits     0
## 6     zRwIyt  Finance      Mexico         #g45y6  FinanceHits    20
## 7     KkvdrI       IT      France          Yahoo  FinanceHits    21
## 10    ihsK6n  Finance     Private       Facebook  FinanceHits    67
## 11    ua9IWm  Finance          US          yahoo PoliticsHits    34
## 12    9l3Grm       HR      Canada           NULL PoliticsHits    89
## 13    jzjWLg       IT South Korea             fb PoliticsHits   398
## 14    f8JwJH     Tech          US        Twitter PoliticsHits   473
## 15    0f8VGC     NULL          US           NULL PoliticsHits    42
## 16    zRwIyt  Finance      Mexico         #g45y6 PoliticsHits   933
## 17    KkvdrI       IT      France          Yahoo PoliticsHits    57
## 19    ZXVIKx  Finance          US        Website PoliticsHits     0
## 20    ihsK6n  Finance     Private       Facebook PoliticsHits    45
## 21    ua9IWm  Finance          US          yahoo     NewsHits    44
## 22    9l3Grm       HR      Canada           NULL     NewsHits  4982
## 23    jzjWLg       IT South Korea             fb     NewsHits   879
## 24    f8JwJH     Tech          US        Twitter     NewsHits   490

We should rename the pro group column and any other column that needs a proper name

colnames(longsample2)[colnames(longsample2)=="."] <- "Page"
colnames(longsample2)[colnames(longsample2)=="value"] <- "Hits"
head(longsample2)
##   visitorID ProGroup     Country ReferralDomain        Page Hits
## 1    ua9IWm  Finance          US          yahoo FinanceHits   12
## 2    9l3Grm       HR      Canada           NULL FinanceHits   45
## 3    jzjWLg       IT South Korea             fb FinanceHits   89
## 4    f8JwJH     Tech          US        Twitter FinanceHits    0
## 6    zRwIyt  Finance      Mexico         #g45y6 FinanceHits   20
## 7    KkvdrI       IT      France          Yahoo FinanceHits   21

We now have a long data frame. If we want to use dplyr to compare the number of hits by pro group, it is better to use the wide version of the data frame. We can go ahead and remove rows that contain null values.

sample.df2<-sample.df[!grepl("NULL", sample.df$FinanceHits),]
sample.df2<-data.frame(sample.df2)
head(sample.df2, 8)
##    visitorID ProGroup     Country ReferralDomain FinanceHits PoliticsHits
## 1     ua9IWm  Finance          US          yahoo          12           34
## 2     9l3Grm       HR      Canada           NULL          45           89
## 3     jzjWLg       IT South Korea             fb          89          398
## 4     f8JwJH     Tech          US        Twitter           0          473
## 6     zRwIyt  Finance      Mexico         #g45y6          20          933
## 7     KkvdrI       IT      France          Yahoo          21           57
## 10    ihsK6n  Finance     Private       Facebook          67           45
##    NewsHits BusinessHits
## 1        44           12
## 2      4982          532
## 3       879           90
## 4       490          493
## 6       903           47
## 7      8933            3
## 10      270            1

Before I do any comparisons, I need to check the data types. The mean in dplyr wont work if the hits columns are not numeric.

str(sample.df2)
## 'data.frame':    7 obs. of  8 variables:
##  $ visitorID     : Factor w/ 10 levels "0f8VGC","3LLLCQ",..: 8 3 6 4 9 7 5
##  $ ProGroup      : Factor w/ 5 levels "Finance","HR",..: 1 2 3 5 1 3 1
##  $ Country       : Factor w/ 7 levels "Canada","Colombia",..: 7 1 6 7 4 3 5
##  $ ReferralDomain: Factor w/ 9 levels "#g45y6","Facebook",..: 8 5 3 6 1 9 2
##  $ FinanceHits   : Factor w/ 8 levels "0","12","20",..: 2 5 7 1 3 4 6
##  $ PoliticsHits  : Factor w/ 10 levels "0","34","398",..: 2 8 3 6 9 7 5
##  $ NewsHits      : Factor w/ 10 levels "270","44","490",..: 2 4 6 3 9 7 1
##  $ BusinessHits  : Factor w/ 10 levels "0","1","12","2",..: 3 8 9 7 6 5 2
sample.df2$FinanceHits<-as.numeric(sample.df2$FinanceHits)
sample.df2$PoliticsHits<-as.numeric(sample.df2$PoliticsHits)
sample.df2$NewsHits<-as.numeric(sample.df2$NewsHits)
sample.df2$BusinessHits<-as.numeric(sample.df2$BusinessHits)

We use dplyr to compare the number of hits to a page grouped by progroup.

sample.df2 %>%
  group_by(ProGroup)%>% 
  summarise(Mean_Financehits = mean(FinanceHits, na.rm=TRUE), 
            Mean_Politicshits=mean(PoliticsHits, na.rm=TRUE), 
            Mean_NewsHits=mean(NewsHits, na.rm=TRUE), 
            Mean_BusinessHits=mean(BusinessHits, na.rm=TRUE))
## # A tibble: 4 x 5
##   ProGroup Mean_Financehits Mean_Politicshits Mean_NewsHits
##   <fct>               <dbl>             <dbl>         <dbl>
## 1 Finance              3.67              5.33          4.00
## 2 HR                   5.00              8.00          4.00
## 3 IT                   5.50              5.00          6.50
## 4 Tech                 1.00              6.00          3.00
## # ... with 1 more variable: Mean_BusinessHits <dbl>

On average, people that identify as a finance professional seem to be engaging with the news pages more than the other pages. With this insight, the teams across the editorial can figure out where are the finance pros getting the latest in finance news.

We can also group by attributes. In the case of our sample data, we have one attribute which is the referral domain. In actual web traffic, we have hundreds of columns for attributes such as brand, topic, adcode, transport, etc.

sample.df2 %>%
  group_by(ReferralDomain)%>% 
  summarise(Mean_Financehits = mean(FinanceHits, na.rm=TRUE), 
            Mean_Politicshits=mean(PoliticsHits, na.rm=TRUE), 
            Mean_NewsHits=mean(NewsHits, na.rm=TRUE), 
            Mean_BusinessHits=mean(BusinessHits, na.rm=TRUE))
## # A tibble: 7 x 5
##   ReferralDomain Mean_Financehits Mean_Politicshits Mean_NewsHits
##   <fct>                     <dbl>             <dbl>         <dbl>
## 1 #g45y6                     3.00              9.00          9.00
## 2 Facebook                   6.00              5.00          1.00
## 3 fb                         7.00              3.00          6.00
## 4 NULL                       5.00              8.00          4.00
## 5 Twitter                    1.00              6.00          3.00
## 6 yahoo                      2.00              2.00          2.00
## 7 Yahoo                      4.00              7.00          7.00
## # ... with 1 more variable: Mean_BusinessHits <dbl>

I notice that there are junk ref doms. Lets say I wanted to see the engagement of just users that come from social media. I can identify that users from facebook are going to finance content on average more than other content. I see that twitter users are going to Politics content on average than other content.