Choose any of the three wide datasets identified in the week 6 discussion items. (You may choose your own) Read the information from your csv into R and use tidyR and dplyr as needed to transform the data. Perform the analysis requested in the discussion item.
Population Dataset Steven Tipton shared this data set in the discussion items The following steps are requested: 1) Melt the data into a Tidy format with just country, year, and population 2) Explore the NA’s and – and research notable events that could have caused change in population 3) Remove entries for Eurasia, North America, and World such as to only have country 4) See how population changes around the world after political shifts 5) Add a column that classifies countries to continents
This data set stood out to me because I think it is really interesting to see how populations changed over time for certain nations. There are also reasons why there have been population shifts and I want to highlight some of them in my analysis in a very easy to read visualization.
Load Libraries Any additional libraries will be loaded in as needed
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(RCurl)
## Loading required package: bitops
##
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
##
## complete
Read in the data which can be found using the link below: https://raw.githubusercontent.com/vindication09/DATA-607-Project-2/master/populationbycountry19802010millions.csv
I will read in the data from a location on my own machine
library(readr)
populationbycountry19802010millions <- read_csv("~/Desktop/DATA Science SPS/DATA 607/Week 5:6/Project2Data/populationbycountry19802010millions.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_character()
## )
## See spec(...) for full column specifications.
head(populationbycountry19802010millions)
## # A tibble: 6 x 32
## X1 `1980` `1981` `1982` `1983` `1984` `1985` `1986` `1987` `1988`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 North … 320.27… 324.44… 328.6… 332.7… 336.7… 340.7… 344.8… 349.0… 353.2…
## 2 Bermuda 0.05473 0.05491 0.055… 0.055… 0.055… 0.056… 0.056… 0.056… 0.057…
## 3 Canada 24.5933 24.9 25.20… 25.45… 25.70… 25.94… 26.20… 26.54… 26.89…
## 4 Greenl… 0.05021 0.05103 0.051… 0.052… 0.052… 0.053… 0.053… 0.0541 0.054…
## 5 Mexico 68.347… 69.969… 71.64… 73.36… 75.08… 76.76… 78.44… 80.12… 81.78…
## 6 Saint … 0.00599 0.00601 0.006… 0.006… 0.006… 0.006… 0.006… 0.006… 0.006…
## # ... with 22 more variables: `1989` <chr>, `1990` <chr>, `1991` <chr>,
## # `1992` <chr>, `1993` <chr>, `1994` <chr>, `1995` <chr>, `1996` <chr>,
## # `1997` <chr>, `1998` <chr>, `1999` <chr>, `2000` <chr>, `2001` <chr>,
## # `2002` <chr>, `2003` <chr>, `2004` <chr>, `2005` <chr>, `2006` <chr>,
## # `2007` <chr>, `2008` <chr>, `2009` <chr>, `2010` <chr>
I want to rename column 1. It is currently called X1, however I want this to be my country column
colnames(populationbycountry19802010millions)[which(names(populationbycountry19802010millions) == "X1")] <- "Country"
head(populationbycountry19802010millions, 2)
## # A tibble: 2 x 32
## Country `1980` `1981` `1982` `1983` `1984` `1985` `1986` `1987` `1988`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 North A… 320.27… 324.4… 328.6… 332.7… 336.7… 340.7… 344.8… 349.0… 353.2…
## 2 Bermuda 0.05473 0.054… 0.055… 0.055… 0.055… 0.056… 0.056… 0.056… 0.057…
## # ... with 22 more variables: `1989` <chr>, `1990` <chr>, `1991` <chr>,
## # `1992` <chr>, `1993` <chr>, `1994` <chr>, `1995` <chr>, `1996` <chr>,
## # `1997` <chr>, `1998` <chr>, `1999` <chr>, `2000` <chr>, `2001` <chr>,
## # `2002` <chr>, `2003` <chr>, `2004` <chr>, `2005` <chr>, `2006` <chr>,
## # `2007` <chr>, `2008` <chr>, `2009` <chr>, `2010` <chr>
I want to create a column that classifies countries into continents. There is a library that allows us to do this called “countrycode.” Before we use this package, we need to do some data prep by removing entries that cant be classified. The classification only works for the mapping country to continent
lets remove items that are not countries such as North America,Eurasia… -North America -Central & South America -Antarctica -Eurasia -Middle East -Asia & Oceania -World -Africa -Europe -Former Czechoslovakia -Former Serbia and Montenegro -Former Yugoslavia -East -Hawaiian Trade Zone -U.S. Pacific Islands -Wake Island -Former U.S.S.R.
It was easy to identify which entries to remove. The first time I ran this function, I was given a warning message that showed a list of entries that could not be classified. I simply copy and pasted this list from the warning and used it as my driving vector to remove them from my data.
#Make a vector that lists all the countries that could not be classified by country code
remove<- c('North America', 'Central & South America', 'Antarctica',
'Eurasia', 'Middle East',
'Asia & Oceania',
'World', 'Africa', 'Europe',
'Former Czechoslovakia', 'Former Serbia and Montenegro', 'Former Yugoslavia',
'East', 'Hawaiian Trade Zone', 'U.S. Pacific Islands', 'Wake Island', 'Former U.S.S.R.')
dfpop <- populationbycountry19802010millions[ !grepl(paste(remove, collapse="|"), populationbycountry19802010millions$Country),]
head(dfpop, 2)
## # A tibble: 2 x 32
## Country `1980` `1981` `1982` `1983` `1984` `1985` `1986` `1987` `1988`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Bermuda 0.05473 0.05491 0.055… 0.055… 0.055… 0.056… 0.056… 0.056… 0.057…
## 2 Canada 24.5933 24.9 25.20… 25.45… 25.70… 25.94… 26.20… 26.54… 26.89…
## # ... with 22 more variables: `1989` <chr>, `1990` <chr>, `1991` <chr>,
## # `1992` <chr>, `1993` <chr>, `1994` <chr>, `1995` <chr>, `1996` <chr>,
## # `1997` <chr>, `1998` <chr>, `1999` <chr>, `2000` <chr>, `2001` <chr>,
## # `2002` <chr>, `2003` <chr>, `2004` <chr>, `2005` <chr>, `2006` <chr>,
## # `2007` <chr>, `2008` <chr>, `2009` <chr>, `2010` <chr>
We also need to convert the data into a data frame or else country code will not work.
df.pop2<-data.frame(dfpop)
nrow(df.pop2)
## [1] 212
head(df.pop2)
## Country X1980 X1981 X1982 X1983
## 1 Bermuda 0.05473 0.05491 0.05517 0.05551
## 2 Canada 24.5933 24.9 25.2019 25.4563
## 3 Greenland 0.05021 0.05103 0.05166 0.05211
## 4 Mexico 68.34748 69.96926 71.6409 73.36288
## 5 Saint Pierre and Miquelon 0.00599 0.00601 0.00605 0.00607
## 6 United States 227.22468 229.46571 231.66446 233.79199
## X1984 X1985 X1986 X1987 X1988 X1989 X1990
## 1 0.05585 0.05618 0.05651 0.05683 0.05717 0.05749 0.05778
## 2 25.7018 25.9416 26.2038 26.5497 26.8948 27.3793 27.7906
## 3 0.05263 0.05315 0.05364 0.0541 0.05485 0.05541 0.05563
## 4 75.08014 76.76723 78.44243 80.12249 81.78182 83.36684 84.91365
## 5 0.00611 0.00616 0.00621 0.00625 0.00628 0.00631 0.00632
## 6 235.8249 237.9238 240.13289 242.28892 244.49898 246.81923 249.62281
## X1991 X1992 X1993 X1994 X1995 X1996 X1997
## 1 0.0581 0.0587 0.05924 0.05975 0.06029 0.06087 0.06145
## 2 28.1179 28.54489 28.95334 29.33081 29.69053 30.02632 30.3056
## 3 0.05554 0.05549 0.05564 0.05592 0.05619 0.05634 0.05651
## 4 86.48803 88.11103 89.74914 91.3379 92.88035 94.39858 95.89515
## 5 0.00633 0.00636 0.00638 0.0064 0.0064 0.00641 0.00642
## 6 252.98094 256.51422 259.91859 263.12582 266.27839 269.39428 272.64693
## X1998 X1999 X2000 X2001 X2002 X2003 X2004
## 1 0.06198 0.06251 0.06306 0.06361 0.06418 0.06476 0.06534
## 2 30.55166 30.82026 31.09956 31.37674 31.64096 31.88931 32.13476
## 3 0.05661 0.0567 0.05689 0.05713 0.05736 0.05754 0.0577
## 4 97.32506 98.61691 99.92662 101.24696 102.47993 103.71806 104.95959
## 5 0.00642 0.00643 0.00641 0.00637 0.00633 0.00629 0.00625
## 6 275.8541 279.04017 282.17196 285.08156 287.80391 290.32642 293.04574
## X2005 X2006 X2007 X2008 X2009 X2010
## 1 0.06591 0.06644 0.06692 0.06739 0.06784 0.06827
## 2 32.38638 32.65668 32.93596 33.2127 33.48721 33.75974
## 3 0.05778 0.05764 0.05753 0.05756 0.0576 0.05764
## 4 106.2029 107.44953 108.70089 109.9554 111.21179 112.46886
## 5 0.0062 0.00615 0.0061 0.00605 0.006 0.00594
## 6 295.75315 298.59321 301.5799 304.37485 307.00655 310.23286
I am not sure why this happened but by converting to a data frame, all of my year columns have an X in front of the name. We can use some regular expression to clean the column names.
#remove the x
names(df.pop2) <- gsub(x = names(df.pop2), pattern = "\\X", replacement = "")
names(df.pop2)
## [1] "Country" "1980" "1981" "1982" "1983" "1984" "1985"
## [8] "1986" "1987" "1988" "1989" "1990" "1991" "1992"
## [15] "1993" "1994" "1995" "1996" "1997" "1998" "1999"
## [22] "2000" "2001" "2002" "2003" "2004" "2005" "2006"
## [29] "2007" "2008" "2009" "2010"
head(df.pop2, 2)
## Country 1980 1981 1982 1983 1984 1985 1986 1987
## 1 Bermuda 0.05473 0.05491 0.05517 0.05551 0.05585 0.05618 0.05651 0.05683
## 2 Canada 24.5933 24.9 25.2019 25.4563 25.7018 25.9416 26.2038 26.5497
## 1988 1989 1990 1991 1992 1993 1994 1995
## 1 0.05717 0.05749 0.05778 0.0581 0.0587 0.05924 0.05975 0.06029
## 2 26.8948 27.3793 27.7906 28.1179 28.54489 28.95334 29.33081 29.69053
## 1996 1997 1998 1999 2000 2001 2002 2003
## 1 0.06087 0.06145 0.06198 0.06251 0.06306 0.06361 0.06418 0.06476
## 2 30.02632 30.3056 30.55166 30.82026 31.09956 31.37674 31.64096 31.88931
## 2004 2005 2006 2007 2008 2009 2010
## 1 0.06534 0.06591 0.06644 0.06692 0.06739 0.06784 0.06827
## 2 32.13476 32.38638 32.65668 32.93596 33.2127 33.48721 33.75974
The data is now prepped for country code. This is going to scan the Country columns and produce a list of their associated Continents in its own data frame. We can append this data frame to our original df.pop2 to complete the task of adding a contient column to each country.
#Lets classify country names into their Continents using the countrycode library
library(countrycode)
df.continent <- as.factor(countrycode(sourcevar = df.pop2[, "Country"], origin = "country.name", destination = "continent"))
df.continent<-data.frame(df.continent)
nrow(df.continent)
## [1] 212
head(df.continent)
## df.continent
## 1 Americas
## 2 Americas
## 3 Americas
## 4 Americas
## 5 Americas
## 6 Americas
#now append continent as its own column
df.countries<-data.frame(df.continent, df.pop2)
head(df.countries)
## df.continent Country X1980 X1981 X1982
## 1 Americas Bermuda 0.05473 0.05491 0.05517
## 2 Americas Canada 24.5933 24.9 25.2019
## 3 Americas Greenland 0.05021 0.05103 0.05166
## 4 Americas Mexico 68.34748 69.96926 71.6409
## 5 Americas Saint Pierre and Miquelon 0.00599 0.00601 0.00605
## 6 Americas United States 227.22468 229.46571 231.66446
## X1983 X1984 X1985 X1986 X1987 X1988 X1989
## 1 0.05551 0.05585 0.05618 0.05651 0.05683 0.05717 0.05749
## 2 25.4563 25.7018 25.9416 26.2038 26.5497 26.8948 27.3793
## 3 0.05211 0.05263 0.05315 0.05364 0.0541 0.05485 0.05541
## 4 73.36288 75.08014 76.76723 78.44243 80.12249 81.78182 83.36684
## 5 0.00607 0.00611 0.00616 0.00621 0.00625 0.00628 0.00631
## 6 233.79199 235.8249 237.9238 240.13289 242.28892 244.49898 246.81923
## X1990 X1991 X1992 X1993 X1994 X1995 X1996
## 1 0.05778 0.0581 0.0587 0.05924 0.05975 0.06029 0.06087
## 2 27.7906 28.1179 28.54489 28.95334 29.33081 29.69053 30.02632
## 3 0.05563 0.05554 0.05549 0.05564 0.05592 0.05619 0.05634
## 4 84.91365 86.48803 88.11103 89.74914 91.3379 92.88035 94.39858
## 5 0.00632 0.00633 0.00636 0.00638 0.0064 0.0064 0.00641
## 6 249.62281 252.98094 256.51422 259.91859 263.12582 266.27839 269.39428
## X1997 X1998 X1999 X2000 X2001 X2002 X2003
## 1 0.06145 0.06198 0.06251 0.06306 0.06361 0.06418 0.06476
## 2 30.3056 30.55166 30.82026 31.09956 31.37674 31.64096 31.88931
## 3 0.05651 0.05661 0.0567 0.05689 0.05713 0.05736 0.05754
## 4 95.89515 97.32506 98.61691 99.92662 101.24696 102.47993 103.71806
## 5 0.00642 0.00642 0.00643 0.00641 0.00637 0.00633 0.00629
## 6 272.64693 275.8541 279.04017 282.17196 285.08156 287.80391 290.32642
## X2004 X2005 X2006 X2007 X2008 X2009 X2010
## 1 0.06534 0.06591 0.06644 0.06692 0.06739 0.06784 0.06827
## 2 32.13476 32.38638 32.65668 32.93596 33.2127 33.48721 33.75974
## 3 0.0577 0.05778 0.05764 0.05753 0.05756 0.0576 0.05764
## 4 104.95959 106.2029 107.44953 108.70089 109.9554 111.21179 112.46886
## 5 0.00625 0.0062 0.00615 0.0061 0.00605 0.006 0.00594
## 6 293.04574 295.75315 298.59321 301.5799 304.37485 307.00655 310.23286
We rename the Continent column to complete the task
#rename, we now have added a column that classifies countries into continents
colnames(df.countries)[which(names(df.countries) == "df.continent")] <- "Continent"
names(df.countries)
## [1] "Continent" "Country" "X1980" "X1981" "X1982"
## [6] "X1983" "X1984" "X1985" "X1986" "X1987"
## [11] "X1988" "X1989" "X1990" "X1991" "X1992"
## [16] "X1993" "X1994" "X1995" "X1996" "X1997"
## [21] "X1998" "X1999" "X2000" "X2001" "X2002"
## [26] "X2003" "X2004" "X2005" "X2006" "X2007"
## [31] "X2008" "X2009" "X2010"
The X came back in front of the year columns. I have no answer as to why this happened but I know how to fix it.
names(df.countries) <- gsub(x = names(df.countries), pattern = "\\X", replacement = "")
names(df.countries)
## [1] "Continent" "Country" "1980" "1981" "1982"
## [6] "1983" "1984" "1985" "1986" "1987"
## [11] "1988" "1989" "1990" "1991" "1992"
## [16] "1993" "1994" "1995" "1996" "1997"
## [21] "1998" "1999" "2000" "2001" "2002"
## [26] "2003" "2004" "2005" "2006" "2007"
## [31] "2008" "2009" "2010"
head(df.countries)
## Continent Country 1980 1981 1982
## 1 Americas Bermuda 0.05473 0.05491 0.05517
## 2 Americas Canada 24.5933 24.9 25.2019
## 3 Americas Greenland 0.05021 0.05103 0.05166
## 4 Americas Mexico 68.34748 69.96926 71.6409
## 5 Americas Saint Pierre and Miquelon 0.00599 0.00601 0.00605
## 6 Americas United States 227.22468 229.46571 231.66446
## 1983 1984 1985 1986 1987 1988 1989
## 1 0.05551 0.05585 0.05618 0.05651 0.05683 0.05717 0.05749
## 2 25.4563 25.7018 25.9416 26.2038 26.5497 26.8948 27.3793
## 3 0.05211 0.05263 0.05315 0.05364 0.0541 0.05485 0.05541
## 4 73.36288 75.08014 76.76723 78.44243 80.12249 81.78182 83.36684
## 5 0.00607 0.00611 0.00616 0.00621 0.00625 0.00628 0.00631
## 6 233.79199 235.8249 237.9238 240.13289 242.28892 244.49898 246.81923
## 1990 1991 1992 1993 1994 1995 1996
## 1 0.05778 0.0581 0.0587 0.05924 0.05975 0.06029 0.06087
## 2 27.7906 28.1179 28.54489 28.95334 29.33081 29.69053 30.02632
## 3 0.05563 0.05554 0.05549 0.05564 0.05592 0.05619 0.05634
## 4 84.91365 86.48803 88.11103 89.74914 91.3379 92.88035 94.39858
## 5 0.00632 0.00633 0.00636 0.00638 0.0064 0.0064 0.00641
## 6 249.62281 252.98094 256.51422 259.91859 263.12582 266.27839 269.39428
## 1997 1998 1999 2000 2001 2002 2003
## 1 0.06145 0.06198 0.06251 0.06306 0.06361 0.06418 0.06476
## 2 30.3056 30.55166 30.82026 31.09956 31.37674 31.64096 31.88931
## 3 0.05651 0.05661 0.0567 0.05689 0.05713 0.05736 0.05754
## 4 95.89515 97.32506 98.61691 99.92662 101.24696 102.47993 103.71806
## 5 0.00642 0.00642 0.00643 0.00641 0.00637 0.00633 0.00629
## 6 272.64693 275.8541 279.04017 282.17196 285.08156 287.80391 290.32642
## 2004 2005 2006 2007 2008 2009 2010
## 1 0.06534 0.06591 0.06644 0.06692 0.06739 0.06784 0.06827
## 2 32.13476 32.38638 32.65668 32.93596 33.2127 33.48721 33.75974
## 3 0.0577 0.05778 0.05764 0.05753 0.05756 0.0576 0.05764
## 4 104.95959 106.2029 107.44953 108.70089 109.9554 111.21179 112.46886
## 5 0.00625 0.0062 0.00615 0.0061 0.00605 0.006 0.00594
## 6 293.04574 295.75315 298.59321 301.5799 304.37485 307.00655 310.23286
We can use tidyr to transform the data. This data is in wide format using key value pairs. I want to gather by key value pairs and create a new long data frame
#Lets gather by key value pairs and create a new data frame
longdf.pop <- df %>% gather(data=df.countries, Population, "1980":"2010")
head(longdf.pop, 10)
## Continent Country . Population
## 1 Americas Bermuda 1980 0.05473
## 2 Americas Canada 1980 24.5933
## 3 Americas Greenland 1980 0.05021
## 4 Americas Mexico 1980 68.34748
## 5 Americas Saint Pierre and Miquelon 1980 0.00599
## 6 Americas United States 1980 227.22468
## 7 Americas Antigua and Barbuda 1980 0.06855
## 8 Americas Argentina 1980 28.3698
## 9 Americas Aruba 1980 --
## 10 Americas Bahamas, The 1980 0.20976
tail(longdf.pop, 10)
## Continent Country . Population
## 6563 Asia Philippines 2010 99.90018
## 6564 Oceania Samoa 2010 0.192
## 6565 Asia Singapore 2010 4.70107
## 6566 Oceania Solomon Islands 2010 0.5592
## 6567 Asia Sri Lanka 2010 21.51399
## 6568 Asia Taiwan 2010 23.02496
## 6569 Asia Thailand 2010 67.0895
## 6570 Oceania Tonga 2010 0.12258
## 6571 Oceania Vanuatu 2010 0.22155
## 6572 Asia Vietnam 2010 89.57113
I need to rename the year column
colnames(longdf.pop)[colnames(longdf.pop)=="."]<-"Year"
names(longdf.pop)
## [1] "Continent" "Country" "Year" "Population"
I made the choice to remove rows that contained NA’s but keep the rows that were blank. This does not imply that population is zero, but rather no population was recorded or the country was not established/does not exist at that time.
longdf.pop<-na.omit(longdf.pop)
head(longdf.pop)
## Continent Country Year Population
## 1 Americas Bermuda 1980 0.05473
## 2 Americas Canada 1980 24.5933
## 3 Americas Greenland 1980 0.05021
## 4 Americas Mexico 1980 68.34748
## 5 Americas Saint Pierre and Miquelon 1980 0.00599
## 6 Americas United States 1980 227.22468
For looking at the change in populations for certain countries, we only need Country, Year, and Population. I will make a subset with just these three columns. In addition, I will examine a country that has blank entries for certain years. In this case I selected Aruba.
df.aruba<-subset(longdf.pop, Country=='Aruba', select=c(Country, Year, Population))
head(df.aruba, 30)
## Country Year Population
## 9 Aruba 1980 --
## 221 Aruba 1981 --
## 433 Aruba 1982 --
## 645 Aruba 1983 --
## 857 Aruba 1984 --
## 1069 Aruba 1985 --
## 1281 Aruba 1986 0.0598
## 1493 Aruba 1987 0.05918
## 1705 Aruba 1988 0.0595
## 1917 Aruba 1989 0.06069
## 2129 Aruba 1990 0.06303
## 2341 Aruba 1991 0.0663
## 2553 Aruba 1992 0.06948
## 2765 Aruba 1993 0.07407
## 2977 Aruba 1994 0.07785
## 3189 Aruba 1995 0.07996
## 3401 Aruba 1996 0.08307
## 3613 Aruba 1997 0.08621
## 3825 Aruba 1998 0.0882
## 4037 Aruba 1999 0.08926
## 4249 Aruba 2000 0.09
## 4461 Aruba 2001 0.09097
## 4673 Aruba 2002 0.09217
## 4885 Aruba 2003 0.09372
## 5097 Aruba 2004 0.09546
## 5309 Aruba 2005 0.09698
## 5521 Aruba 2006 0.0985
## 5733 Aruba 2007 0.10002
## 5945 Aruba 2008 0.10154
## 6157 Aruba 2009 0.10307
After doing some research online, I learned that Aruba became a self governing country in 1986 after it gained left the Netherlands. Was there a trend in the population of Aruba after it became an established nation?
#Lets quickly visualize
library(ggplot2)
ggplot(data=df.aruba, aes(x=Year, y=Population, group=1)) +
geom_line(arrow = arrow())+
geom_point()+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
labs( x="Year", y="Population (In Millions)")
Now I want to examine how population changed over time after a major political event. Lets say I want to see how the population of Cuba changed after the events of the cold war. The cold war resulted in an embargo placed on Cuba. I know that at some point, there was a mass exodus of migrants coming to the United States from Cuba. Castro’s government sent many of its own citizens into exile, including top performers of their time such as Celia Cruz.
More info on the migration of Cubans can be found here: https://www.migrationpolicy.org/article/cuban-immigrants-united-states
df.cuba<-subset(longdf.pop, Country=='Cuba', select=c(Country, Year, Population))
head(df.cuba)
## Country Year Population
## 19 Cuba 1980 9.65298
## 231 Cuba 1981 9.71198
## 443 Cuba 1982 9.78922
## 655 Cuba 1983 9.88126
## 867 Cuba 1984 9.97373
## 1079 Cuba 1985 10.06563
#Visualize the population
ggplot(data=df.cuba, aes(x=Year, y=Population, group=1)) +
geom_line(arrow = arrow())+
geom_point()+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
labs( x="Year", y="Population (In Millions)")
There was a massive decline in the Cuban population between 1984 and 1985. This information matches up with the document shared via link. After 1985, the Cuban population has a steady increase. The exiles loyal to the Batista regime died out while new generations were born under the Castro regime.
Movie Data Set: This data was shared by Meaghan Burke The following steps are requested: 1) Deal with missing data 2) rename columns - There are several columns so I will rename those columns that I use in my analysis.
I selected this data set because I am a huge fan of movies. I usually watch 4-5 movies a week if time permits.
As for analysis, there were no furthur requested items so I will visually examine the budget vs gross for my favorite franchise “Star Wars.” Any additional anlysis will be shown below.
Read in the data
movies <- read_csv(getURL("https://raw.githubusercontent.com/vindication09/DATA-607-Project-2/master/movie_metadata.csv"))
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 4 parsing failures.
## row # A tibble: 4 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 2324 budget an integer 2400000000 literal data file 2 2989 budget an integer 12215500000 literal data row 3 3006 budget an integer 2500000000 literal data col 4 3860 budget an integer 4200000000 literal data
head(movies,3)
## # A tibble: 3 x 28
## color director_name num_critic_for_revi… duration director_facebook_li…
## <chr> <chr> <int> <int> <int>
## 1 Color James Cameron 723 178 0
## 2 Color Gore Verbinski 302 169 563
## 3 Color Sam Mendes 602 148 0
## # ... with 23 more variables: actor_3_facebook_likes <int>,
## # actor_2_name <chr>, actor_1_facebook_likes <int>, gross <int>,
## # genres <chr>, actor_1_name <chr>, movie_title <chr>,
## # num_voted_users <int>, cast_total_facebook_likes <int>,
## # actor_3_name <chr>, facenumber_in_poster <int>, plot_keywords <chr>,
## # movie_imdb_link <chr>, num_user_for_reviews <int>, language <chr>,
## # country <chr>, content_rating <chr>, budget <int>, title_year <int>,
## # actor_2_facebook_likes <int>, imdb_score <dbl>, aspect_ratio <dbl>,
## # movie_facebook_likes <int>
nrow(movies)
## [1] 5043
names(movies)
## [1] "color" "director_name"
## [3] "num_critic_for_reviews" "duration"
## [5] "director_facebook_likes" "actor_3_facebook_likes"
## [7] "actor_2_name" "actor_1_facebook_likes"
## [9] "gross" "genres"
## [11] "actor_1_name" "movie_title"
## [13] "num_voted_users" "cast_total_facebook_likes"
## [15] "actor_3_name" "facenumber_in_poster"
## [17] "plot_keywords" "movie_imdb_link"
## [19] "num_user_for_reviews" "language"
## [21] "country" "content_rating"
## [23] "budget" "title_year"
## [25] "actor_2_facebook_likes" "imdb_score"
## [27] "aspect_ratio" "movie_facebook_likes"
Lets check the number of NA’s and blanks by column
colSums(is.na(movies)|movies == '')
## color director_name
## 19 104
## num_critic_for_reviews duration
## 50 15
## director_facebook_likes actor_3_facebook_likes
## 104 23
## actor_2_name actor_1_facebook_likes
## 13 7
## gross genres
## 884 0
## actor_1_name movie_title
## 7 0
## num_voted_users cast_total_facebook_likes
## 0 0
## actor_3_name facenumber_in_poster
## 23 13
## plot_keywords movie_imdb_link
## 153 0
## num_user_for_reviews language
## 21 12
## country content_rating
## 5 303
## budget title_year
## 496 108
## actor_2_facebook_likes imdb_score
## 13 0
## aspect_ratio movie_facebook_likes
## 329 0
It looks like gross year has the most NA and or blank entries with a total of 884 This does not imply that there was no gross earnings but rather no such information was collected lets compare movie budget and gross, however to compare I would need to remove the NA’s and blanks. Hopefully this will not affect the Star Wars movies.
movies2<-na.omit(movies)
nrow(movies2) #This brings us down from 5043 rows to 2752
## [1] 3752
I want a data frame that collects movie title, budget, and gross for Star Wars Films Star Wars: Episode VII - The Force Awakens Star Wars: Episode II - Attack of the Clones Star Wars: Episode III - Revenge of the Sith Star Wars: Episode I - The Phantom Menace Star Wars: Episode VI - Return of the Jedi Star Wars: Episode V - The Empire Strikes Back Star Wars: Episode IV - A New Hope
If I am lucky, then the Star Wars movies are stil there because they did not have any missing rows. I find it time consuming to manually type the full names of the Star wars films. I will use a wild card with an anchor to subset them instead.
#we can use a wild card to select all the rows associated with star wars films
moviecost.df<-movies2[grep("^Star Wars:", movies2$movie_title), ]
head(moviecost.df)
## # A tibble: 6 x 28
## color director_name num_critic_for_revi… duration director_facebook_l…
## <chr> <chr> <int> <int> <int>
## 1 Color George Lucas 359 140 0
## 2 Color George Lucas 284 142 0
## 3 Color George Lucas 320 136 0
## 4 Color Richard Marqua… 197 134 37
## 5 Color Irvin Kershner 223 127 883
## 6 Color George Lucas 282 125 0
## # ... with 23 more variables: actor_3_facebook_likes <int>,
## # actor_2_name <chr>, actor_1_facebook_likes <int>, gross <int>,
## # genres <chr>, actor_1_name <chr>, movie_title <chr>,
## # num_voted_users <int>, cast_total_facebook_likes <int>,
## # actor_3_name <chr>, facenumber_in_poster <int>, plot_keywords <chr>,
## # movie_imdb_link <chr>, num_user_for_reviews <int>, language <chr>,
## # country <chr>, content_rating <chr>, budget <int>, title_year <int>,
## # actor_2_facebook_likes <int>, imdb_score <dbl>, aspect_ratio <dbl>,
## # movie_facebook_likes <int>
I need movie title, budget, and gross
#i only need movie_title, budget, and gross
starwars.df<-subset(moviecost.df, select=c(movie_title, budget, gross))
head(starwars.df)
## # A tibble: 6 x 3
## movie_title budget gross
## <chr> <int> <int>
## 1 Star Wars: Episode III - Revenge of the Sith 113000000 380262555
## 2 Star Wars: Episode II - Attack of the Clones 115000000 310675583
## 3 Star Wars: Episode I - The Phantom Menace 115000000 474544677
## 4 Star Wars: Episode VI - Return of the Jedi 32500000 309125409
## 5 Star Wars: Episode V - The Empire Strikes Back 18000000 290158751
## 6 Star Wars: Episode IV - A New Hope 11000000 460935665
starwars.df<-data.frame(starwars.df)
I want to melt this subset using the reshape library to make the use of ggplot to better.
#melt into long dataframe
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
swmelt <- melt(starwars.df, id = 'movie_title')
head(swmelt, 15)
## movie_title variable value
## 1 Star Wars: Episode III - Revenge of the Sith budget 113000000
## 2 Star Wars: Episode II - Attack of the Clones budget 115000000
## 3 Star Wars: Episode I - The Phantom Menace budget 115000000
## 4 Star Wars: Episode VI - Return of the Jedi budget 32500000
## 5 Star Wars: Episode V - The Empire Strikes Back budget 18000000
## 6 Star Wars: Episode IV - A New Hope budget 11000000
## 7 Star Wars: Episode III - Revenge of the Sith gross 380262555
## 8 Star Wars: Episode II - Attack of the Clones gross 310675583
## 9 Star Wars: Episode I - The Phantom Menace gross 474544677
## 10 Star Wars: Episode VI - Return of the Jedi gross 309125409
## 11 Star Wars: Episode V - The Empire Strikes Back gross 290158751
## 12 Star Wars: Episode IV - A New Hope gross 460935665
Lets create a stacked bar plot that looks at the gross vs budget for each of the Star Wars Films
#I want to visually compare the performance of star wars movies by comparing budget to gross
library("ggplot2")
ggplot() + geom_bar(aes(y = value, x = movie_title, fill =variable ), data = swmelt,stat="identity")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
labs( x="Star Wars Film", y="Dollar Amount")
It appears that the budget for each of the Star Wars films was bigger than its gross. I would be interested in knowing if this is just at the film release or does it also take DVD/Blu ray sales into account.
I want to examine a correlation between movie rating and number of social media likes. I will need movie title, imdb score, and movie facebook likes. These are also fields that can be renamed.
Lets make a subet of the three columns we want and see if it is necessary to perform any transformations
movielikesdf<-subset(movies, select=c(movie_title, imdb_score, movie_facebook_likes))
head(movielikesdf)
## # A tibble: 6 x 3
## movie_title imdb_score movie_facebook_l…
## <chr> <dbl> <int>
## 1 Avatar 7.90 33000
## 2 Pirates of the Caribbean: At World's End 7.10 0
## 3 Spectre 6.80 85000
## 4 The Dark Knight Rises 8.50 164000
## 5 Star Wars: Episode VII - The Force Awakens 7.10 0
## 6 John Carter 6.60 24000
I notice that there are movies that do not have any facebook likes. It should be noted that this does not imply there were no likes but for some reason, that information was not collected.
There are 2181 entries that have a zero value for facebook likes. I would remove them because they would not add any inisght into the study of relationship between social media likes and movie score. Furthur data collection is required.
colSums(movielikesdf == '0')
## movie_title imdb_score movie_facebook_likes
## 0 0 2181
We should also check if there are NA values
colSums(is.na(movielikesdf))
## movie_title imdb_score movie_facebook_likes
## 0 0 0
Lets remove the zero rows and perform a simple correlation on movie rating and number of facebook likes.
movielikesdf[-row(movielikesdf)[movielikesdf == 0],]
## # A tibble: 2,862 x 3
## movie_title imdb_score movie_facebook_…
## <chr> <dbl> <int>
## 1 Avatar 7.90 33000
## 2 Spectre 6.80 85000
## 3 The Dark Knight Rises 8.50 164000
## 4 John Carter 6.60 24000
## 5 Tangled 7.80 29000
## 6 Avengers: Age of Ultron 7.50 118000
## 7 Harry Potter and the Half-Blood Prince 7.50 10000
## 8 Batman v Superman: Dawn of Justice 6.90 197000
## 9 Pirates of the Caribbean: Dead Man's Chest 7.30 5000
## 10 The Lone Ranger 6.50 48000
## # ... with 2,852 more rows
Lets run a simple correlation test Visually
library("ggpubr")
## Loading required package: magrittr
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
##
## extract
ggscatter(movielikesdf, x="imdb_score", y="movie_facebook_likes" ,
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "IMDB Score", ylab = "Facebook Likes")
I can get a better story with a numerical test such as a Pearson Correlation Test
x<-movielikesdf$imdb_score
y<-movielikesdf$movie_facebook_likes
res <- cor.test(x, y,
method = "pearson")
res
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 18.102, df = 5041, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2209543 0.2727893
## sample estimates:
## cor
## 0.2470485
Our t statistic is 18.102 with 5041 degrees of freedom. Our P value is small and we also have a confidence interval. If I pick my alpha to be .05, then I can conclude that IMDB rating and facebook likes are correlated with a correlation coefficient of 0.247. The correlation coefficient shows a weak positive relationship. Because of the weak correlation coefficient, I would want more information such as if the facebook likes came after or before a movies release. Some movies are hyped up collecting lots of facebook likes before it is actually release. I would wonder if this is the case here. It would be great to be able to partition the likes into pre release likes and post release likes.
Sample Web Traffic Data This final dataset is my own data set. It is similar to the type of data I work with in media, specifically web traffic. I am unable to supply actual data due to legal but I can simluate data that looks exactly similar .
The type of data has user level granularity. When a visitor comes to our site and engages in content such as watching a video, reading an article, or checking stock quotes, they are identified with a unique id and their behaviors are on the site are collected. We usually collect about 1 TB of data a day and our engineering team aggregates the data into a monthly format.
Lets simulate some data For the sake of the problem, we will just generate 10 unique ids and randomly assign attributes to them
#This will simulate unique strings that look like user ids
library(stringi)
x<-stri_rand_strings(10, 6)
df.x<-data.frame(as.character(x))
df.x
## as.character.x.
## 1 ua9IWm
## 2 9l3Grm
## 3 jzjWLg
## 4 f8JwJH
## 5 0f8VGC
## 6 zRwIyt
## 7 KkvdrI
## 8 3LLLCQ
## 9 ZXVIKx
## 10 ihsK6n
We then assign each user a professional group they are associated with. Some values will include junk.
progroups<-c("Finance", "HR", "IT", "Tech", "NULL", "Finance", "IT", "HR", "Finance", "Finance")
df.progroups<-data.frame(as.character(progroups))
df.progroups
## as.character.progroups.
## 1 Finance
## 2 HR
## 3 IT
## 4 Tech
## 5 NULL
## 6 Finance
## 7 IT
## 8 HR
## 9 Finance
## 10 Finance
We randomly assign users a Country. Country is usually identified using the user agent that is attached to a visitor coming to our site. The user agent also contains information on a visitors browser type or operating system. These are known as technographics.
country<-c("US","Canada","South Korea","US", "US", "Mexico", "France","Colombia","US", "Private")
df.country<-data.frame(as.character(country))
df.country
## as.character.country.
## 1 US
## 2 Canada
## 3 South Korea
## 4 US
## 5 US
## 6 Mexico
## 7 France
## 8 Colombia
## 9 US
## 10 Private
Lets randomly assign users a referral domain. This is simply where a visitor came from before coming to our site. An example would be a user seeing a link on facebook shared by my company , so they click on it. This users referral domain would simply be facebook. This attribute has been known to show lots of junk values. This could be because of private browser settings or the usage of VPNs.
refdom<-c("yahoo","NULL","fb", "Twitter", "NULL", "#g45y6", "Yahoo","ggggggggg", "Website","Facebook")
df.refdom<-data.frame(as.character(refdom))
df.refdom
## as.character.refdom.
## 1 yahoo
## 2 NULL
## 3 fb
## 4 Twitter
## 5 NULL
## 6 #g45y6
## 7 Yahoo
## 8 ggggggggg
## 9 Website
## 10 Facebook
Now lets simulate the number of hits on a specific page per user. This is the number of times a user engaged with the content on our finance page or our politics page. Engage is a blanket term that covers clicks, mouse hovers, and video watching. It is not unsual to see a single user coming to a page more than 100 times. We actually have a way to monitor page refreshes. Someone might land on a page and squat on that page. Each page refreshes every minute so that would count as a hit.
For now, lets say we only have 4 pages we are interested in collecting user information on
FinanceHits<-c("12","45", "89", "0", 'NULL',"20","21", "NULL", "NULL", "67")
PoliticsHits<-c("34","89","398","473", "42", '933',"57", "NULL","0", "45")
NewsHits<-c("44","4982", "879","490","624", "903", "8933", "NULL", "902","270")
BusinessHits<-c("12", "532", "90", "493", "909", "47", "3", "0", "2", "1")
df.hits<-data.frame(FinanceHits, PoliticsHits, NewsHits, BusinessHits)
df.hits
## FinanceHits PoliticsHits NewsHits BusinessHits
## 1 12 34 44 12
## 2 45 89 4982 532
## 3 89 398 879 90
## 4 0 473 490 493
## 5 NULL 42 624 909
## 6 20 933 903 47
## 7 21 57 8933 3
## 8 NULL NULL NULL 0
## 9 NULL 0 902 2
## 10 67 45 270 1
lets put together our sample data frame
sample.df<-data.frame(df.x,df.progroups,df.country, df.refdom, df.hits)
sample.df
## as.character.x. as.character.progroups. as.character.country.
## 1 ua9IWm Finance US
## 2 9l3Grm HR Canada
## 3 jzjWLg IT South Korea
## 4 f8JwJH Tech US
## 5 0f8VGC NULL US
## 6 zRwIyt Finance Mexico
## 7 KkvdrI IT France
## 8 3LLLCQ HR Colombia
## 9 ZXVIKx Finance US
## 10 ihsK6n Finance Private
## as.character.refdom. FinanceHits PoliticsHits NewsHits BusinessHits
## 1 yahoo 12 34 44 12
## 2 NULL 45 89 4982 532
## 3 fb 89 398 879 90
## 4 Twitter 0 473 490 493
## 5 NULL NULL 42 624 909
## 6 #g45y6 20 933 903 47
## 7 Yahoo 21 57 8933 3
## 8 ggggggggg NULL NULL NULL 0
## 9 Website NULL 0 902 2
## 10 Facebook 67 45 270 1
colnames(sample.df)[colnames(sample.df)=="as.character.x."] <- "visitorID"
colnames(sample.df)[colnames(sample.df)=="as.character.progroups."] <- "ProGroup"
colnames(sample.df)[colnames(sample.df)=="as.character.country."] <- "Country"
colnames(sample.df)[colnames(sample.df)=="as.character.refdom."] <- "ReferralDomain"
head(sample.df)
## visitorID ProGroup Country ReferralDomain FinanceHits PoliticsHits
## 1 ua9IWm Finance US yahoo 12 34
## 2 9l3Grm HR Canada NULL 45 89
## 3 jzjWLg IT South Korea fb 89 398
## 4 f8JwJH Tech US Twitter 0 473
## 5 0f8VGC NULL US NULL NULL 42
## 6 zRwIyt Finance Mexico #g45y6 20 933
## NewsHits BusinessHits
## 1 44 12
## 2 4982 532
## 3 879 90
## 4 490 493
## 5 624 909
## 6 903 47
We now have a sample wide format that simlates web traffic data at the user level. We can proceed to performing transformations as needed on the data set.
Since this sample data is small, if I remove user 5oplek, then I will no longer have a value for hits. If I were to model on this data realistically, removing users would not be a problem because the data contains millions of users.
I want to look at the number of hits to a page grouped by professional group.
longsample <- sample.df %>% gather(data=sample.df, value, FinanceHits:BusinessHits)
## Warning: attributes are not identical across measure variables;
## they will be dropped
head(longsample, 20)
## visitorID ProGroup Country ReferralDomain . value
## 1 ua9IWm Finance US yahoo FinanceHits 12
## 2 9l3Grm HR Canada NULL FinanceHits 45
## 3 jzjWLg IT South Korea fb FinanceHits 89
## 4 f8JwJH Tech US Twitter FinanceHits 0
## 5 0f8VGC NULL US NULL FinanceHits NULL
## 6 zRwIyt Finance Mexico #g45y6 FinanceHits 20
## 7 KkvdrI IT France Yahoo FinanceHits 21
## 8 3LLLCQ HR Colombia ggggggggg FinanceHits NULL
## 9 ZXVIKx Finance US Website FinanceHits NULL
## 10 ihsK6n Finance Private Facebook FinanceHits 67
## 11 ua9IWm Finance US yahoo PoliticsHits 34
## 12 9l3Grm HR Canada NULL PoliticsHits 89
## 13 jzjWLg IT South Korea fb PoliticsHits 398
## 14 f8JwJH Tech US Twitter PoliticsHits 473
## 15 0f8VGC NULL US NULL PoliticsHits 42
## 16 zRwIyt Finance Mexico #g45y6 PoliticsHits 933
## 17 KkvdrI IT France Yahoo PoliticsHits 57
## 18 3LLLCQ HR Colombia ggggggggg PoliticsHits NULL
## 19 ZXVIKx Finance US Website PoliticsHits 0
## 20 ihsK6n Finance Private Facebook PoliticsHits 45
We now have a long data frame that can be subset to answer the first question. Compare the number of hits to page type by professional groups. I notive that there are NULL values. Since they won’t add anything to the analysis, I can remove those rows.
longsample2<-longsample[!grepl("NULL", longsample$value),]
head(longsample2, 20)
## visitorID ProGroup Country ReferralDomain . value
## 1 ua9IWm Finance US yahoo FinanceHits 12
## 2 9l3Grm HR Canada NULL FinanceHits 45
## 3 jzjWLg IT South Korea fb FinanceHits 89
## 4 f8JwJH Tech US Twitter FinanceHits 0
## 6 zRwIyt Finance Mexico #g45y6 FinanceHits 20
## 7 KkvdrI IT France Yahoo FinanceHits 21
## 10 ihsK6n Finance Private Facebook FinanceHits 67
## 11 ua9IWm Finance US yahoo PoliticsHits 34
## 12 9l3Grm HR Canada NULL PoliticsHits 89
## 13 jzjWLg IT South Korea fb PoliticsHits 398
## 14 f8JwJH Tech US Twitter PoliticsHits 473
## 15 0f8VGC NULL US NULL PoliticsHits 42
## 16 zRwIyt Finance Mexico #g45y6 PoliticsHits 933
## 17 KkvdrI IT France Yahoo PoliticsHits 57
## 19 ZXVIKx Finance US Website PoliticsHits 0
## 20 ihsK6n Finance Private Facebook PoliticsHits 45
## 21 ua9IWm Finance US yahoo NewsHits 44
## 22 9l3Grm HR Canada NULL NewsHits 4982
## 23 jzjWLg IT South Korea fb NewsHits 879
## 24 f8JwJH Tech US Twitter NewsHits 490
We should rename the pro group column and any other column that needs a proper name
colnames(longsample2)[colnames(longsample2)=="."] <- "Page"
colnames(longsample2)[colnames(longsample2)=="value"] <- "Hits"
head(longsample2)
## visitorID ProGroup Country ReferralDomain Page Hits
## 1 ua9IWm Finance US yahoo FinanceHits 12
## 2 9l3Grm HR Canada NULL FinanceHits 45
## 3 jzjWLg IT South Korea fb FinanceHits 89
## 4 f8JwJH Tech US Twitter FinanceHits 0
## 6 zRwIyt Finance Mexico #g45y6 FinanceHits 20
## 7 KkvdrI IT France Yahoo FinanceHits 21
We now have a long data frame. If we want to use dplyr to compare the number of hits by pro group, it is better to use the wide version of the data frame. We can go ahead and remove rows that contain null values.
sample.df2<-sample.df[!grepl("NULL", sample.df$FinanceHits),]
sample.df2<-data.frame(sample.df2)
head(sample.df2, 8)
## visitorID ProGroup Country ReferralDomain FinanceHits PoliticsHits
## 1 ua9IWm Finance US yahoo 12 34
## 2 9l3Grm HR Canada NULL 45 89
## 3 jzjWLg IT South Korea fb 89 398
## 4 f8JwJH Tech US Twitter 0 473
## 6 zRwIyt Finance Mexico #g45y6 20 933
## 7 KkvdrI IT France Yahoo 21 57
## 10 ihsK6n Finance Private Facebook 67 45
## NewsHits BusinessHits
## 1 44 12
## 2 4982 532
## 3 879 90
## 4 490 493
## 6 903 47
## 7 8933 3
## 10 270 1
Before I do any comparisons, I need to check the data types. The mean in dplyr wont work if the hits columns are not numeric.
str(sample.df2)
## 'data.frame': 7 obs. of 8 variables:
## $ visitorID : Factor w/ 10 levels "0f8VGC","3LLLCQ",..: 8 3 6 4 9 7 5
## $ ProGroup : Factor w/ 5 levels "Finance","HR",..: 1 2 3 5 1 3 1
## $ Country : Factor w/ 7 levels "Canada","Colombia",..: 7 1 6 7 4 3 5
## $ ReferralDomain: Factor w/ 9 levels "#g45y6","Facebook",..: 8 5 3 6 1 9 2
## $ FinanceHits : Factor w/ 8 levels "0","12","20",..: 2 5 7 1 3 4 6
## $ PoliticsHits : Factor w/ 10 levels "0","34","398",..: 2 8 3 6 9 7 5
## $ NewsHits : Factor w/ 10 levels "270","44","490",..: 2 4 6 3 9 7 1
## $ BusinessHits : Factor w/ 10 levels "0","1","12","2",..: 3 8 9 7 6 5 2
sample.df2$FinanceHits<-as.numeric(sample.df2$FinanceHits)
sample.df2$PoliticsHits<-as.numeric(sample.df2$PoliticsHits)
sample.df2$NewsHits<-as.numeric(sample.df2$NewsHits)
sample.df2$BusinessHits<-as.numeric(sample.df2$BusinessHits)
We use dplyr to compare the number of hits to a page grouped by progroup.
sample.df2 %>%
group_by(ProGroup)%>%
summarise(Mean_Financehits = mean(FinanceHits, na.rm=TRUE),
Mean_Politicshits=mean(PoliticsHits, na.rm=TRUE),
Mean_NewsHits=mean(NewsHits, na.rm=TRUE),
Mean_BusinessHits=mean(BusinessHits, na.rm=TRUE))
## # A tibble: 4 x 5
## ProGroup Mean_Financehits Mean_Politicshits Mean_NewsHits
## <fct> <dbl> <dbl> <dbl>
## 1 Finance 3.67 5.33 4.00
## 2 HR 5.00 8.00 4.00
## 3 IT 5.50 5.00 6.50
## 4 Tech 1.00 6.00 3.00
## # ... with 1 more variable: Mean_BusinessHits <dbl>
On average, people that identify as a finance professional seem to be engaging with the news pages more than the other pages. With this insight, the teams across the editorial can figure out where are the finance pros getting the latest in finance news.
We can also group by attributes. In the case of our sample data, we have one attribute which is the referral domain. In actual web traffic, we have hundreds of columns for attributes such as brand, topic, adcode, transport, etc.
sample.df2 %>%
group_by(ReferralDomain)%>%
summarise(Mean_Financehits = mean(FinanceHits, na.rm=TRUE),
Mean_Politicshits=mean(PoliticsHits, na.rm=TRUE),
Mean_NewsHits=mean(NewsHits, na.rm=TRUE),
Mean_BusinessHits=mean(BusinessHits, na.rm=TRUE))
## # A tibble: 7 x 5
## ReferralDomain Mean_Financehits Mean_Politicshits Mean_NewsHits
## <fct> <dbl> <dbl> <dbl>
## 1 #g45y6 3.00 9.00 9.00
## 2 Facebook 6.00 5.00 1.00
## 3 fb 7.00 3.00 6.00
## 4 NULL 5.00 8.00 4.00
## 5 Twitter 1.00 6.00 3.00
## 6 yahoo 2.00 2.00 2.00
## 7 Yahoo 4.00 7.00 7.00
## # ... with 1 more variable: Mean_BusinessHits <dbl>
I notice that there are junk ref doms. Lets say I wanted to see the engagement of just users that come from social media. I can identify that users from facebook are going to finance content on average more than other content. I see that twitter users are going to Politics content on average than other content.