Code Glossary for DA 101

This is an R Markdown document. Please fill in all example text with your own words and definitions. For each function in the code glossary, you should show your code. If you need extra code to do some data wrangling or bring in a new library, please hide it from the rendered html. Each term should have: (1) what package does it come from? (2) general definition - what does the function do and for what kind of tasks/situations would you use it? (3) worked example, using your dataset, or one of the example datasets from an R package (4) text below the example that explains specifically, what the code accomplished. In other words, what was the input and the output? What can you do now that you have run the example that you couldn’t before?

Data Wrangling

0. Explain what “tidy data” is and why it is useful for coding.

Definition: Tidy data is when each column is a variable, each row is an observation, and each cell has one single value. It is useful when we have to make interpretations or manipulate data or both.

Code example (show an example dataset preview or header that you can use to explain):

head(fifa_wc_audience)
##   X        country confederation population_share tv_audience_share
## 1 1  United States      CONCACAF              4.5               4.3
## 2 2          Japan           AFC              1.9               4.9
## 3 3          China           AFC             19.5              14.8
## 4 4        Germany          UEFA              1.2               2.9
## 5 5         Brazil      CONMEBOL              2.8               7.1
## 6 6 United Kingdom          UEFA              0.9               2.1
##   gdp_weighted_share
## 1               11.3
## 2                9.1
## 3                7.3
## 4                6.3
## 5                5.4
## 6                4.2

Explanation: The dataset I preview with the head() function is tidy which is seen by noting that each row represents a particular country and the audience share and gdp share and the confederation which they are a part of. —

1. Compare and contrast summary, str, and unique

Package: ggplot

Definition: Summary is a function which shows the summary of various measures of each column such as minimum value,max value,median,mean,etc. Str stands for structure which is a shorter form of summary and it shows what type of variable each observation under that column is. Unique is just another method to show the dataset but without including every single or including only a range of rows.

Code example:

summary(fifa_wc_audience)
##        X           country          confederation      population_share 
##  Min.   :  1.0   Length:191         Length:191         Min.   : 0.0000  
##  1st Qu.: 48.5   Class :character   Class :character   1st Qu.: 0.0000  
##  Median : 96.0   Mode  :character   Mode  :character   Median : 0.1000  
##  Mean   : 96.0                                         Mean   : 0.5225  
##  3rd Qu.:143.5                                         3rd Qu.: 0.3500  
##  Max.   :191.0                                         Max.   :19.5000  
##  tv_audience_share gdp_weighted_share
##  Min.   : 0.000    Min.   : 0.0000   
##  1st Qu.: 0.000    1st Qu.: 0.0000   
##  Median : 0.100    Median : 0.0000   
##  Mean   : 0.523    Mean   : 0.5204   
##  3rd Qu.: 0.300    3rd Qu.: 0.3000   
##  Max.   :14.800    Max.   :11.3000
str(fifa_wc_audience)
## 'data.frame':    191 obs. of  6 variables:
##  $ X                 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ country           : chr  "United States" "Japan" "China" "Germany" ...
##  $ confederation     : chr  "CONCACAF" "AFC" "AFC" "UEFA" ...
##  $ population_share  : num  4.5 1.9 19.5 1.2 2.8 0.9 0.9 0.9 2.1 0.7 ...
##  $ tv_audience_share : num  4.3 4.9 14.8 2.9 7.1 2.1 2.1 2 3.1 1.8 ...
##  $ gdp_weighted_share: num  11.3 9.1 7.3 6.3 5.4 4.2 4 4 3.5 3.1 ...
unique(fifa_wc_audience)
##       X                  country confederation population_share
## 1     1            United States      CONCACAF              4.5
## 2     2                    Japan           AFC              1.9
## 3     3                    China           AFC             19.5
## 4     4                  Germany          UEFA              1.2
## 5     5                   Brazil      CONMEBOL              2.8
## 6     6           United Kingdom          UEFA              0.9
## 7     7                    Italy          UEFA              0.9
## 8     8                   France          UEFA              0.9
## 9     9                   Russia          UEFA              2.1
## 10   10                    Spain          UEFA              0.7
## 11   11              South Korea           AFC              0.7
## 12   12                Indonesia           AFC              3.5
## 13   13                   Mexico      CONCACAF              1.7
## 14   14                   Turkey          UEFA              1.1
## 15   15                 Thailand           AFC              1.0
## 16   16                Argentina      CONMEBOL              0.6
## 17   17              Netherlands          UEFA              0.2
## 18   18                   Poland          UEFA              0.6
## 19   19             Saudi Arabia           AFC              0.4
## 20   20                   Taiwan           AFC              0.3
## 21   21                   Canada      CONCACAF              0.5
## 22   22                 Colombia      CONMEBOL              0.7
## 23   23                Venezuela      CONMEBOL              0.4
## 24   24             South Africa           CAF              0.7
## 25   25                 Malaysia           AFC              0.4
## 26   26              Switzerland          UEFA              0.1
## 27   27                  Nigeria           CAF              2.3
## 28   28                  Belgium          UEFA              0.2
## 29   29                   Sweden          UEFA              0.1
## 30   30                  Vietnam           AFC              1.3
## 31   31                     Iran           AFC              1.1
## 32   32                    Chile      CONMEBOL              0.3
## 33   33                  Romania          UEFA              0.3
## 34   34                  Austria          UEFA              0.1
## 35   35                Singapore           AFC              0.1
## 36   36                Australia           AFC              0.3
## 37   37                   Greece          UEFA              0.2
## 38   38                 Portugal          UEFA              0.2
## 39   39                    India           AFC             17.6
## 40   40           Czech Republic          UEFA              0.2
## 41   41                    Egypt           CAF              1.1
## 42   42                  Denmark          UEFA              0.1
## 43   43                   Norway          UEFA              0.1
## 44   44                     Peru      CONMEBOL              0.4
## 45   45                  Ukraine          UEFA              0.7
## 46   46                     Iraq           AFC              0.5
## 47   47                  Hungary          UEFA              0.1
## 48   48               Kazakhstan          UEFA              0.2
## 49   49                  Finland          UEFA              0.1
## 50   50                  Ireland          UEFA              0.1
## 51   51                  Algeria           CAF              0.5
## 52   52                     Cuba      CONCACAF              0.2
## 53   53                  Ecuador      CONMEBOL              0.2
## 54   54                 Slovakia          UEFA              0.1
## 55   55                    Qatar           AFC              0.0
## 56   56                   Kuwait           AFC              0.0
## 57   57                 Bulgaria          UEFA              0.1
## 58   58                   Serbia          UEFA              0.1
## 59   59                  Belarus          UEFA              0.1
## 60   60                Hong Kong           AFC              0.1
## 61   61                  Croatia          UEFA              0.1
## 62   62                     Oman           AFC              0.0
## 63   63        Dominica Republic      CONCACAF              0.1
## 64   64               Azerbaijan          UEFA              0.1
## 65   65              New Zealand           OFC              0.1
## 66   66                Lithuania          UEFA              0.0
## 67   67                 Slovenia          UEFA              0.0
## 68   68                  Uruguay      CONMEBOL              0.0
## 69   69               Costa Rica      CONCACAF              0.1
## 70   70               Uzbekistan           AFC              0.4
## 71   71                    Yemen           CAF              0.3
## 72   72                   Israel          UEFA              0.1
## 73   73              El Salvador      CONCACAF              0.1
## 74   74                    Syria           AFC              0.3
## 75   75                 Pakistan           AFC              2.5
## 76   76                Guatemala      CONCACAF              0.2
## 77   77                 Paraguay      CONMEBOL              0.1
## 78   78                   Panama      CONCACAF              0.1
## 79   79       Bosnia-Herzegovina          UEFA              0.1
## 80   80                 Cambodia           AFC              0.2
## 81   81              Ivory Coast           CAF              0.3
## 82   82                    Macau           AFC              0.0
## 83   83                   Latvia          UEFA              0.0
## 84   84                  Lebanon           AFC              0.1
## 85   85                   Jordan           AFC              0.1
## 86   86                 Honduras      CONCACAF              0.1
## 87   87                   Brunei           AFC              0.0
## 88   88                  Albania          UEFA              0.0
## 89   89             Turkmenistan           AFC              0.1
## 90   90                   Angola           CAF              0.3
## 91   91                  Estonia          UEFA              0.0
## 92   92                  Bahrain           AFC              0.0
## 93   93                    Nepal           AFC              0.4
## 94   94                   Cyprus          UEFA              0.0
## 95   95                    Ghana           CAF              0.4
## 96   96                Mauritius           CAF              0.0
## 97   97                Macedonia          UEFA              0.0
## 98   98                    Kenya           CAF              0.6
## 99   99         Trinidad &Tobago      CONCACAF              0.0
## 100 100              Philippines           AFC              1.4
## 101 101                  Bolivia      CONMEBOL              0.1
## 102 102                     Laos           AFC              0.1
## 103 103                  Armenia          UEFA              0.0
## 104 104                Nicaragua      CONCACAF              0.1
## 105 105              Afghanistan           AFC              0.4
## 106 106                   Kosovo          UEFA              0.0
## 107 107                 Cameroon           CAF              0.3
## 108 108                  Senegal           CAF              0.2
## 109 109                  Jamaica      CONCACAF              0.0
## 110 110                Sri Lanka           AFC              0.3
## 111 111                  Myanmar           AFC              0.8
## 112 112                  Moldova          UEFA              0.1
## 113 113                    Malta          UEFA              0.0
## 114 114               Bangladesh           AFC              2.2
## 115 115                   Zambia           CAF              0.2
## 116 116                  Morocco           CAF              0.5
## 117 117              North Korea           AFC              0.4
## 118 118                 Botswana           CAF              0.0
## 119 119               Tajikistan           AFC              0.1
## 120 120                  Iceland          UEFA              0.0
## 121 121                   Uganda           CAF              0.5
## 122 122                    Libya           CAF              0.1
## 123 123                Palestine           AFC              0.1
## 124 124               Kyrgyzstan           AFC              0.1
## 125 125                 Mongolia           AFC              0.0
## 126 126               Montenegro          UEFA              0.0
## 127 127                 Tanzania           CAF              0.7
## 128 128                  Georgia          UEFA              0.1
## 129 129                 Suriname      CONCACAF              0.0
## 130 130                    Sudan           CAF              0.5
## 131 131                    Gabon           CAF              0.0
## 132 132               Madagascar           CAF              0.3
## 133 133                  Tunisia           AFC              0.2
## 134 134                 Ethiopia           CAF              1.3
## 135 135                 Zimbabwe           CAF              0.2
## 136 136                  Namibia           CAF              0.0
## 137 137                  Bahamas      CONCACAF              0.0
## 138 138         Papua New Guinea           OFC              0.1
## 139 139                   Guyana      CONCACAF              0.0
## 140 140           Turks & Caicos      CONCACAF              0.0
## 141 141                 Congo DR           CAF              0.9
## 142 142             Burkina Faso           CAF              0.2
## 143 143                   Guinea           CAF              0.2
## 144 144                    Haiti      CONCACAF              0.1
## 145 145                     Fiji           OFC              0.0
## 146 146                 Barbados      CONCACAF              0.0
## 147 147                     Mali           CAF              0.2
## 148 148                  Bermuda      CONCACAF              0.0
## 149 149              St. Maarten      CONCACAF              0.0
## 150 150        Equatorial Guinea           CAF              0.0
## 151 151               Mozambique           CAF              0.3
## 152 152               Seychelles           CAF              0.0
## 153 153               Cape Verde           CAF              0.0
## 154 154                    Benin           CAF              0.1
## 155 155                Swaziland           CAF              0.0
## 156 156           Cayman Islands      CONCACAF              0.0
## 157 157                    Aruba      CONCACAF              0.0
## 158 158                 Maldives           AFC              0.0
## 159 159                    Niger           CAF              0.2
## 160 160                  Curacao      CONCACAF              0.0
## 161 161                St. Lucia      CONCACAF              0.0
## 162 162                     Togo           CAF              0.1
## 163 163                  Burundi           CAF              0.1
## 164 164              Congo, Rep.           CAF              0.1
## 165 165        Antigua & Barbuda      CONCACAF              0.0
## 166 166                     Chad           CAF              0.2
## 167 167                  Eritrea           CAF              0.1
## 168 168                  Grenada      CONCACAF              0.0
## 169 169                  Lesotho           CAF              0.0
## 170 170              St. Vincent      CONCACAF              0.0
## 171 171                   Malawi           CAF              0.2
## 172 172             Sierra Leone           CAF              0.1
## 173 173               Mauritania           CAF              0.1
## 174 174          Solomon Islands           OFC              0.0
## 175 175                 Dominica      CONCACAF              0.0
## 176 176                    Timor           AFC              0.0
## 177 177                St. Kitts      CONCACAF              0.0
## 178 178                   Rwanda           CAF              0.2
## 179 179                  Somalia           CAF              0.1
## 180 180                   Gambia           CAF              0.0
## 181 181                    Samoa           OFC              0.0
## 182 182            Guinea-Bissau           CAF              0.0
## 183 183 Central African Republic           CAF              0.1
## 184 184                  Vanuatu           OFC              0.0
## 185 185           American Samoa           OFC              0.0
## 186 186             Cook Islands           OFC              0.0
## 187 187                    Tonga           OFC              0.0
## 188 188                  Liberia           CAF              0.1
## 189 189                    Palau           OFC              0.0
## 190 190                    Nauru           OFC              0.0
## 191 191                     Niue           OFC              0.0
##     tv_audience_share gdp_weighted_share
## 1                 4.3               11.3
## 2                 4.9                9.1
## 3                14.8                7.3
## 4                 2.9                6.3
## 5                 7.1                5.4
## 6                 2.1                4.2
## 7                 2.1                4.0
## 8                 2.0                4.0
## 9                 3.1                3.5
## 10                1.8                3.1
## 11                1.8                3.0
## 12                6.7                2.9
## 13                3.2                2.6
## 14                2.3                2.0
## 15                2.4                1.6
## 16                1.5                1.6
## 17                0.6                1.5
## 18                1.2                1.3
## 19                0.5                1.2
## 20                0.5                1.0
## 21                0.5                1.0
## 22                1.6                0.9
## 23                1.0                0.9
## 24                1.3                0.8
## 25                0.7                0.7
## 26                0.3                0.7
## 27                2.6                0.7
## 28                0.3                0.7
## 29                0.3                0.7
## 30                2.6                0.6
## 31                0.7                0.6
## 32                0.6                0.6
## 33                0.7                0.6
## 34                0.3                0.6
## 35                0.2                0.6
## 36                0.3                0.5
## 37                0.3                0.5
## 38                0.4                0.5
## 39                2.0                0.5
## 40                0.3                0.5
## 41                0.8                0.5
## 42                0.2                0.5
## 43                0.1                0.4
## 44                0.8                0.4
## 45                0.9                0.4
## 46                0.5                0.4
## 47                0.3                0.4
## 48                0.3                0.3
## 49                0.2                0.3
## 50                0.1                0.3
## 51                0.4                0.3
## 52                0.3                0.3
## 53                0.5                0.3
## 54                0.2                0.3
## 55                0.0                0.2
## 56                0.1                0.2
## 57                0.2                0.2
## 58                0.3                0.2
## 59                0.2                0.2
## 60                0.1                0.2
## 61                0.1                0.1
## 62                0.0                0.1
## 63                0.2                0.1
## 64                0.1                0.1
## 65                0.1                0.1
## 66                0.1                0.1
## 67                0.1                0.1
## 68                0.1                0.1
## 69                0.2                0.1
## 70                0.5                0.1
## 71                0.4                0.1
## 72                0.1                0.1
## 73                0.2                0.1
## 74                0.3                0.1
## 75                0.4                0.1
## 76                0.2                0.1
## 77                0.2                0.1
## 78                0.1                0.1
## 79                0.1                0.1
## 80                0.5                0.1
## 81                0.4                0.1
## 82                0.0                0.1
## 83                0.1                0.1
## 84                0.1                0.1
## 85                0.1                0.1
## 86                0.3                0.1
## 87                0.0                0.1
## 88                0.1                0.1
## 89                0.1                0.1
## 90                0.1                0.1
## 91                0.0                0.0
## 92                0.0                0.0
## 93                0.4                0.0
## 94                0.0                0.0
## 95                0.2                0.0
## 96                0.0                0.0
## 97                0.1                0.0
## 98                0.3                0.0
## 99                0.0                0.0
## 100               0.1                0.0
## 101               0.1                0.0
## 102               0.2                0.0
## 103               0.1                0.0
## 104               0.1                0.0
## 105               0.3                0.0
## 106               0.1                0.0
## 107               0.2                0.0
## 108               0.2                0.0
## 109               0.1                0.0
## 110               0.1                0.0
## 111               0.1                0.0
## 112               0.1                0.0
## 113               0.0                0.0
## 114               0.1                0.0
## 115               0.1                0.0
## 116               0.1                0.0
## 117               0.2                0.0
## 118               0.0                0.0
## 119               0.1                0.0
## 120               0.0                0.0
## 121               0.2                0.0
## 122               0.0                0.0
## 123               0.1                0.0
## 124               0.1                0.0
## 125               0.0                0.0
## 126               0.0                0.0
## 127               0.1                0.0
## 128               0.0                0.0
## 129               0.0                0.0
## 130               0.1                0.0
## 131               0.0                0.0
## 132               0.1                0.0
## 133               0.0                0.0
## 134               0.2                0.0
## 135               0.1                0.0
## 136               0.0                0.0
## 137               0.0                0.0
## 138               0.1                0.0
## 139               0.0                0.0
## 140               0.0                0.0
## 141               0.2                0.0
## 142               0.1                0.0
## 143               0.1                0.0
## 144               0.1                0.0
## 145               0.0                0.0
## 146               0.0                0.0
## 147               0.0                0.0
## 148               0.0                0.0
## 149               0.0                0.0
## 150               0.0                0.0
## 151               0.1                0.0
## 152               0.0                0.0
## 153               0.0                0.0
## 154               0.0                0.0
## 155               0.0                0.0
## 156               0.0                0.0
## 157               0.0                0.0
## 158               0.0                0.0
## 159               0.1                0.0
## 160               0.0                0.0
## 161               0.0                0.0
## 162               0.0                0.0
## 163               0.1                0.0
## 164               0.0                0.0
## 165               0.0                0.0
## 166               0.0                0.0
## 167               0.0                0.0
## 168               0.0                0.0
## 169               0.0                0.0
## 170               0.0                0.0
## 171               0.0                0.0
## 172               0.0                0.0
## 173               0.0                0.0
## 174               0.0                0.0
## 175               0.0                0.0
## 176               0.0                0.0
## 177               0.0                0.0
## 178               0.0                0.0
## 179               0.0                0.0
## 180               0.0                0.0
## 181               0.0                0.0
## 182               0.0                0.0
## 183               0.0                0.0
## 184               0.0                0.0
## 185               0.0                0.0
## 186               0.0                0.0
## 187               0.0                0.0
## 188               0.0                0.0
## 189               0.0                0.0
## 190               0.0                0.0
## 191               0.0                0.0

Explanation: After we run the code each function shows us a different result about the data set which could be helpful in further analysis.


Demonstrate each of the main dplyr verbs in Questions 2-7. In your examples, these may be used alone or together with other functions, but your definition and explanation must focus specifically on the function in the prompt.

2. Use select to manipulate a dataframe

Package: dplyr

Definition: Select function means to literally select a group of columns and show only those columns in the data set

Code example:

country_tv_share<-fifa_wc_audience%>%
  select(country,confederation,tv_audience_share)
head(country_tv_share)
##          country confederation tv_audience_share
## 1  United States      CONCACAF               4.3
## 2          Japan           AFC               4.9
## 3          China           AFC              14.8
## 4        Germany          UEFA               2.9
## 5         Brazil      CONMEBOL               7.1
## 6 United Kingdom          UEFA               2.1

Explanation:When we run the code we can see that I asked R to ‘select’ only the country,confederation,and tv_audience_share columns and give me a new dataset only with those columns


3. Use arrange to manipulate a dataframe

Package: dplyr

Definition: The arrange function arranges your data set in a high to low or a low to high order of a particular column.

Code example:

fifa_wc_audience1<-fifa_wc_audience%>%
  arrange(desc(gdp_weighted_share))
head(fifa_wc_audience1)
##   X        country confederation population_share tv_audience_share
## 1 1  United States      CONCACAF              4.5               4.3
## 2 2          Japan           AFC              1.9               4.9
## 3 3          China           AFC             19.5              14.8
## 4 4        Germany          UEFA              1.2               2.9
## 5 5         Brazil      CONMEBOL              2.8               7.1
## 6 6 United Kingdom          UEFA              0.9               2.1
##   gdp_weighted_share
## 1               11.3
## 2                9.1
## 3                7.3
## 4                6.3
## 5                5.4
## 6                4.2

Explanation: When we run the code we can see that now the gdp_share is arranged from a high to low order. When we arrange from high to low we use arrange(desc()) which stands for descending and only arrange() for ascending


4. Use filter to manipulate a dataframe

Package: dplyr

Definition: The filter function filters your data and keeps certain rows which satisfies the condition you put for the filter function. We can add multiple filter conditions.

Code example:

fifa_wc_audience2<-fifa_wc_audience%>%
  filter(population_share>2.0,gdp_weighted_share>1.0)
head(fifa_wc_audience2)
##    X       country confederation population_share tv_audience_share
## 1  1 United States      CONCACAF              4.5               4.3
## 2  3         China           AFC             19.5              14.8
## 3  5        Brazil      CONMEBOL              2.8               7.1
## 4  9        Russia          UEFA              2.1               3.1
## 5 12     Indonesia           AFC              3.5               6.7
##   gdp_weighted_share
## 1               11.3
## 2                7.3
## 3                5.4
## 4                3.5
## 5                2.9

Explanation: Here I only wanted to know about the countries whose gdp weightage was higher than 1.0 and their population share was higher than 2.0.


5. Use mutate to manipulate a dataframe

Package: dplyr

Definition: Mutate is used to mutate a dataset. We can add new columns as functions of existing variables, modify existing columns, and delete columns

Code example:

fifa_wc_audience3<-fifa_wc_audience%>%
  mutate(popular_football_country=ifelse(tv_audience_share>=1.5,1,0))
head(fifa_wc_audience3)
##   X        country confederation population_share tv_audience_share
## 1 1  United States      CONCACAF              4.5               4.3
## 2 2          Japan           AFC              1.9               4.9
## 3 3          China           AFC             19.5              14.8
## 4 4        Germany          UEFA              1.2               2.9
## 5 5         Brazil      CONMEBOL              2.8               7.1
## 6 6 United Kingdom          UEFA              0.9               2.1
##   gdp_weighted_share popular_football_country
## 1               11.3                        1
## 2                9.1                        1
## 3                7.3                        1
## 4                6.3                        1
## 5                5.4                        1
## 6                4.2                        1

Explanation: Here I chose to add another column which would tell me if football is particularly popular in that country and I did that by making a new column which shows 1 is a country has a tv audience share greater than equal to 1.5 and shows 0 otherwise.


6. Use summarize to manipulate a dataframe

Package: dplyr

Definition: Summarize function creates a new data frame which returns one row for a combination of grouping variables

Code example:

fifa_wc_audience4<-fifa_wc_audience%>%
  group_by(confederation)%>%
  summarise(mean(gdp_weighted_share))
head(fifa_wc_audience4)
## # A tibble: 6 × 2
##   confederation `mean(gdp_weighted_share)`
##   <chr>                              <dbl>
## 1 AFC                              0.735  
## 2 CAF                              0.052  
## 3 CONCACAF                         0.527  
## 4 CONMEBOL                         1.03   
## 5 OFC                              0.00833
## 6 UEFA                             0.848

Explanation: When we run the code we see that there is a whole new dataset created which has been made by each observation being a particular confederation and from that we summarized the mean of the gdp_weighted_share for each confederation and showed it as a seperate column.


7. Use group_by to manipulate a dataframe

Package: dplyr

Definition: It is a function used to group data according to one or multiple variables. Using group_by() each time splits the data into subsets on the basis of the variables in the grouping. After grouping data, you can then summarize or transform within each group independently of another group.

Code example:

fifa_wc_audience5<-fifa_wc_audience%>%
  group_by(confederation)%>%
  summarise(mean(gdp_weighted_share),
            mean(tv_audience_share),
            mean(population_share))
head(fifa_wc_audience5)
## # A tibble: 6 × 4
##   confederation `mean(gdp_weighted_share)` `mean(tv_audience_share)`
##   <chr>                              <dbl>                     <dbl>
## 1 AFC                              0.735                      0.991 
## 2 CAF                              0.052                      0.172 
## 3 CONCACAF                         0.527                      0.327 
## 4 CONMEBOL                         1.03                       1.35  
## 5 OFC                              0.00833                    0.0167
## 6 UEFA                             0.848                      0.548 
## # ℹ 1 more variable: `mean(population_share)` <dbl>

Explanation: When we run the code we see that there is a whole new dataset created which has been grouped by the confederation which means is observation is now a different confederation rather than a different country


8. Handle NAs in a data frame or in a column

Describe and demonstrate at least two different ways of handling NA values that might appear in a data frame or in a column of a data frame that you are trying to work with.

function_names

Package: ggplot,dplyr

Definition: NA values are values that are missing from the dataset. NA stands for Not Applicable. While computing certain parameters NA values may interfere in the calculation so it is important to remove them.

Code example 1:

mean(us_voter_turnout$votes)
## [1] NA
mean(us_voter_turnout$votes,na.rm=TRUE)
## [1] 3074280

Code example 2:

us_voter_turnout_by_year<-us_voter_turnout%>%
  group_by(year)%>%
  summarise(
    count=sum(!is.na(votes)),
    mean_voter_ratio=mean(votes/eligible_voters,na.rm=TRUE))

Explanation: From the two above code examples we can see two different ways to exclude NA values from a data set. In the first code, we can see the mean value of votes without excluding the NA values is NA but after we use na.rm=TRUE we exclude the missing values and there is a value of the mean.In the second code I used dplyr to group and summarize the counts of states whose data was present in that particular year. The states whose data was missing, showed NA values. I wanted to count only the states whose data was present, so I used !is.na to show the count without the NA values. Using is.na in itself would show the count only with the NA values.


9. Use conditional statements to manipulate a dataframe

ifelse

Package:

Definition:

Code example:

Explanation:


10. Bring multiple datasets together by stacking them

rbind

Package:

Definition:

Code example:

Explanation:


11. Bring multiple datasets together using merge and/or join

Package:

Definition:

Code example:

Explanation:


12. Use Regular Expressions or stringr functions to manipulate text data

function_name

Package:

Definition:

Code example:

Explanation:


13. Transform and work with datetime values

function_name

Package:

Definition:

Code example:

Explanation:


14. Write your own function to automate a task

function

Package:

Definition:

Code example:

Explanation:


Polished Data Visualization using ggplot2

**All visuals should use functions from the ggplot2 and/or ggmap libraries

15. Histogram

Package: ggplot

Definition: A histogram is a data visualization which divides the x axis into bins and counts the number of observations in each bin and displays it with bars

Code example:

ggplot(fifa_wc_audience,aes(x=gdp_weighted_share,))+
  geom_histogram(color="black",fill="skyblue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Explanation: After running the code we see that a clear histogram has been created which measures the gdp shares of various countries. From the histogram we see that majority of the countries listed in the data contribute about 0-1% of the entire worlds gdp.


16. Frequency Polygon

Package: ggplot

Definition: It is a polygon with straight lines, used like a histogram but insead of displaying the frequency with bars the frequency is explained with lines.

Code example:

ggplot(tdd_results,aes(x=age))+
   geom_freqpoly(binwidth=2, color="blue") +
  labs(title = "Frequency Polygon of Age", x = "Age", y = "Frequency")

Explanation: When we run the code we can see that we get a polygon which shows us the frequency of the ages of people taking part in a donut race in Ohio. It is better suited for this type of variable since Age can be categorically divided into different levels.


17. Box plot

Package: ggplot

Definition: A boxplot compactly displays the distribution of a continuous variable. It visualizes five summary statistics (the median, two hinges and two whiskers), and all “outlying” points individually.

Code example:

ggplot(fifa_wc_audience,aes(x=confederation,y=tv_audience_share))+
  geom_boxplot()

Explanation: The box plot shows that UEFA and CONMEBOL have the highest and most varied TV audience shares, with UEFA having the largest number of outliers. CONMEBOL displays a more consistent range of values, though still with a few high outliers. In contrast, AFC, CAF, CONCACAF, and OFC have much lower audience shares, with fewer outliers and less variability. This suggests UEFA and CONMEBOL teams generally attract larger TV audiences compared to other confederations.


18. Scatter plot

Package: ggplot

Definition: A scatterplot is most useful for displaying the relationship between two continuous variables. It can be used to compare one continuous and one categorical variable, or two categorical variables

Code example:

ggplot(fifa_wc_audience,aes(x=gdp_weighted_share,y=tv_audience_share))+
  geom_point()+
  geom_smooth(method="lm", linetype=1)

Explanation: With the above graph I wanted to see whether a country with a higher gdp would generate a larger TV audience share. To do that I plotted the independent variable on the X axis and the dependent one on the Y axis. From the graph we can see a weak positive correlation between the two variables.


19. Line plot

Package: ggplot

Definition: A line plot is a kind of chart that joins data points with lines to show trends in data over time or categories. The x-axis shows the time or categories the measurement occurred in, and the y-axis shows the value. It’s helpful for finding patterns such as rises or falls, and it works effectively to compare many data sets on one graph.

Code example:

us_voter_turnout_by_year<-us_voter_turnout%>%
  group_by(year)%>%
  summarise(
    mean_votes=mean(votes,na.rm=TRUE)
  )



ggplot(us_voter_turnout_by_year,aes(x=year,y=mean_votes))+
  geom_line()+
  labs(title="Distribuition of Votes by Year",
  x="Year",
  y="Votes")

Explanation: If we wanted to see how the mean number of votes has increased over time then we can run a line plot wherein we can see how many votes there were for each congress session. Note: The graph has dips every two years since each session lasts two years.


20. Your favorite other kind of ggplot, not yet demonstrated

function_name

Package: ggplot

Definition: Barplot is a graph used to compute frequencies of non-continuous variables. It is similar to a histogram and used bars to display the counts but a histogram shows frequencies for continous variables.

Code example:

ggplot(fifa_wc_audience,aes(x=confederation))+
   geom_bar(position = "dodge", alpha = 0.8, color="black",fill="blue") +
  labs(title = "Bar Plot of No.of Countries in Confederations",
       x = "Confederation", y = "Count") 

Explanation: Here I wanted to found the number of countries in each confederation. Confederations, being a non continuous categorical variable fits in the x axis of the bar plot. We can easily conclude that CAF has the highest number of countries while CONMEBOL has the lowest number of countries in the data.


21.Geographic Map; function_name

Package:

Definition:

Code example:

Explanation:


22. a polished table using kable, xtable, or pander**

function_name

Package:

Definition:

Code example:

Explanation:


Statistical and Analytical Tools

23. Demonstrate how you could calculate all of these key summary statistics and describe what you can learn from them:

mean, median, max, min, interquartile range, standard deviaton You may choose to use one or more functions or code statements to demonstrate and explain

Package: dplyr

Definition: These tools allow for the calculation of key summary statistics, providing insights into the distribution and characteristics of a dataset. Here, it demonstrates how to calculate the mean, median, maximum, minimum, interquartile range, and standard deviation using the dplyr package

Code example:

summary_stats <- fifa_wc_audience %>%
  summarise(
    Mean = mean(population_share,na.rm=TRUE),
    Median = median(population_share),
    Maximum = max(population_share),
    Minimum = min(population_share),
    Interquartile_Range = IQR(population_share),
    Standard_Deviation = sd(population_share)
  )

print(summary_stats)
##        Mean Median Maximum Minimum Interquartile_Range Standard_Deviation
## 1 0.5225131    0.1    19.5       0                0.35           1.960335

Explanation: In this code, the summarise calculates key summary statistics for the ‘Mutual_Funds’ column in the dataset. The summarise function computes multiple summary statistics in a single step.

mean(population_share) calculates the mean (average) value of the ‘Mutual_Funds’ column.

median(population_share) computes the median, which represents the middle value when the data is sorted in ascending order.

max(population_share) calculates the maximum value in the ‘population_share’ column.

min(population_share) computes the minimum value in the ‘population_share’ column.

IQR(population_share) calculates the interquartile range, which is the range between the 25th and 75th percentiles of the data.

sd(population_share) calculates the standard deviation, which measures the dispersion of data points around the mean.


24. One sample t-test

Package: Hmisc

Definition: T-test is one category of hypothesis tests, it compares the means of two groups to determine if they are statistically different. A one-sample t-test compares the mean of a single sample to a known or hypothesized population mean to see if there is a significant difference between them

Code example:

t.test(fifa_wc_audience$population_share,mu=0.5,na.rm=TRUE)
## 
##  One Sample t-test
## 
## data:  fifa_wc_audience$population_share
## t = 0.15872, df = 190, p-value = 0.8741
## alternative hypothesis: true mean is not equal to 0.5
## 95 percent confidence interval:
##  0.2427202 0.8023060
## sample estimates:
## mean of x 
## 0.5225131

Explanation: The one-sample t-test was conducted on the ‘population_share’ column from the fifa_wc_audience dataset. The hypothesis being tested is whether the mean population share of the dataset is significantly different from 0.5 (the hypothesized mean). This analysis provides valuable statistical information about the mean age in the finance dataset, helping to understand whether it significantly differs from the hypothesized mean of 0.5. Here, the p value was significantly high which means we cannot reject our nulll hypothesis of mu= 0.5.


25. Two sample t-test

Package: Hmisc

Definition: The two sample t-test compares the means of two independent samples to see if there is a significant difference between them.

Code example:

fifa_wc_audience_americas<-fifa_wc_audience%>%
  filter(confederation=="CONCACAF"|confederation=="CONMEBOL")

t.test(fifa_wc_audience_americas$tv_audience_share~fifa_wc_audience_americas$confederation)
## 
##  Welch Two Sample t-test
## 
## data:  fifa_wc_audience_americas$tv_audience_share by fifa_wc_audience_americas$confederation
## t = -1.4978, df = 10.267, p-value = 0.1643
## alternative hypothesis: true difference in means between group CONCACAF and group CONMEBOL is not equal to 0
## 95 percent confidence interval:
##  -2.5402968  0.4936302
## sample estimates:
## mean in group CONCACAF mean in group CONMEBOL 
##              0.3266667              1.3500000

Explanation: Here we are trying to see if the mean difference between the two confederations of CONMEBOL and CONCACAF in tv_audience_shares is 0 or not. Our p value is 0.16 which is higher than alpha, and thus we do not have evidence to reject the null hypothesis of the mean difference being 0


26. Compare and contrast Correlation versus Correlation Test

cor and cor.test

Package: Hmisc

Definition: This function allows for the calculation of multiple correlation coefficients simultaneously, providing insights into both linear and rank-based relationships between variables. This test allows for the evaluation of the statistical significance of the correlations observed in the data

Code example:

cor(fifa_wc_audience$population_share, fifa_wc_audience$tv_audience_share)
## [1] 0.7313239
cor.test(fifa_wc_audience$population_share, fifa_wc_audience$tv_audience_share)
## 
##  Pearson's product-moment correlation
## 
## data:  fifa_wc_audience$population_share and fifa_wc_audience$tv_audience_share
## t = 14.741, df = 189, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6576280 0.7911553
## sample estimates:
##       cor 
## 0.7313239

Explanation: Both the measures tell us the correlation between the two variables in the dataset. As we can see cor in itself only tells us the amount of correlation between the two variables. cor.test however gives us the correlation, with the t statistic and the p value. cor.test is the more significant one since it allows us to see if our obtained correlation is statistically significant or not.


27. Simple linear regression (bivariate)

lm()

Package: Stats

Definition: It is used to identify if two variables are related to each other. For example to check if a change in variable x has an effect on variable y

Code example:

model1<-lm(tv_audience_share~population_share,fifa_wc_audience)
summary(model1)
## 
## Call:
## lm(formula = tv_audience_share ~ population_share, data = fifa_wc_audience)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7579 -0.2405 -0.1946 -0.0486  5.3454 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.24048    0.07424   3.239  0.00142 ** 
## population_share  0.54076    0.03668  14.741  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9912 on 189 degrees of freedom
## Multiple R-squared:  0.5348, Adjusted R-squared:  0.5324 
## F-statistic: 217.3 on 1 and 189 DF,  p-value: < 2.2e-16

Explanation: From the result above we identify that if a graph was to be made with these variables, the slope would be 0.54. This signifies that there is a weak positive relationship between the variables. If population_share increases by 1, the tv_audience_share increases by 0.54. The relationship is verified as the p value is very low. Note: The null hypothesis for a linear regression test is that there is 0 relation between the two variables, ie. the slope is 0.


28. Multiple linear regression (multivariate)

lm()

Package: Stats

Definition: : It estimates the coefficients for a linear model with multiple predictor variables and one outcome variable.

Code example:

model2<-lm(tv_audience_share~population_share+gdp_weighted_share,fifa_wc_audience)
summary(model2)
## 
## Call:
## lm(formula = tv_audience_share ~ population_share + gdp_weighted_share, 
##     data = fifa_wc_audience)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6141 -0.0850 -0.0493  0.0150  3.8005 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.04928    0.05275   0.934    0.351    
## population_share    0.35733    0.02822  12.661   <2e-16 ***
## gdp_weighted_share  0.55157    0.03796  14.532   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6821 on 188 degrees of freedom
## Multiple R-squared:  0.7809, Adjusted R-squared:  0.7786 
## F-statistic: 335.1 on 2 and 188 DF,  p-value: < 2.2e-16

Explanation: Now, we are comparing what effect does gdp weighted share have, in addition to the population share on tv audience share which we tested in #27. The important thing to note is that the coefficient estimates for both the variables we’re are estimated only if the other one is held constant. This means that the 0.357 value of population share is the value only if gdp wweighted share was held constant during the test.


29. Multicollinearity

ggpairs

Package: GGally

Definition: Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated with each other, which can lead to issues with the model’s stability and interpretation.

Code example:

ggpairs(fifa_wc_audience[, c("tv_audience_share", "population_share", "gdp_weighted_share")])

Explanation: This uses the function ggpairs() provided by the GGally library to create a pair plot to visually explore associations between the variables tv_audience_share, population_share, and gdp_weighted_share in the fifa_wc_audience data frame. In this pair plot, it shows the scatterplots for each pair of variables, distributions of individual variables, and their correlation coefficients, which may give an overview about how each variable relates to one another. This plot helps in the identification of multicollinearity-a situation where independent variables are highly interrelated. Any pair of two variables showing very high values of correlation in this pair plot could suggest the existence of multicollinearity; such variables would carry information that overlaps in regression models. Multicollinearity can be detected and thus curbed since it always results in insecure estimates of coefficients of regression.


30. Validate your model

gg_diagnose or autoplot

Package: ggfortify

Definition: A group of 4 graphs that validate our regression model and help us check the assumptions of a linear regression

Code example:

model1<-lm(tv_audience_share~population_share,fifa_wc_audience)
summary(model1)
## 
## Call:
## lm(formula = tv_audience_share ~ population_share, data = fifa_wc_audience)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7579 -0.2405 -0.1946 -0.0486  5.3454 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.24048    0.07424   3.239  0.00142 ** 
## population_share  0.54076    0.03668  14.741  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9912 on 189 degrees of freedom
## Multiple R-squared:  0.5348, Adjusted R-squared:  0.5324 
## F-statistic: 217.3 on 1 and 189 DF,  p-value: < 2.2e-16
autoplot(model1)

Explanation: Autoplot gives us 4 graphs to check our regression. The first plot shows a random scatter around zero without any specific patterns or “fan shapes”,ie. if the dataset starts fanning out after a point or not. The second plot (Q-Q plot) shows us the same thing as a histogram of residuals which is if the data is normally distributed. Points should roughly follow the line. Deviations suggest non-normality, especially at the tails


31. Make a histogram of your residuals

as.data.frame, geom_histogram

Package: ggplot

Definition: Residuals is the difference between the predicted value and the actual value in a dataset. One of the assumptions of a linear regression is that there should be normality in the dataset which is proven if a histogram of residuals is normally distributed or not.

Code example:

resid<-as.data.frame(model1$residuals)
colnames(resid)<-"residuals"

ggplot(resid, aes(residuals))+
  geom_histogram(binwidth = 0.6, fill="red",
                 color="black", alpha=0.7)+
  labs(title="Histogram of Residuals",
       x="Residuals",
       y="Frequency")

Explanation: It first pulls the residuals-the differences between actual and predicted values-out of the model, and transforms it into a data frame for easier manipulation. This uses ggplot2 to plot a histogram of the residuals, mapping the column residuals to the x-axis. This histogram displays the distribution of residuals, which is useful to check whether residuals are approximately normally distributed-a fundamental assumption for the linear regression models.


32. Make predictions using predict

Package: Stats

Definition: Function for predictions from the results of various model fitting functions.

Code example:

predicted_values <- predict(model1, data = fifa_wc_audience) 
predicted_values
##          1          2          3          4          5          6          7 
##  2.6739205  1.2679327 10.7853883  0.8893975  1.7546208  0.7271682  0.7271682 
##          8          9         10         11         12         13         14 
##  0.7271682  1.3760856  0.6190153  0.6190153  2.1331559  1.1597798  0.8353211 
##         15         16         17         18         19         20         21 
##  0.7812446  0.5649388  0.3486330  0.5649388  0.4567859  0.4027095  0.5108624 
##         22         23         24         25         26         27         28 
##  0.6190153  0.4567859  0.6190153  0.4567859  0.2945566  1.4842385  0.3486330 
##         29         30         31         32         33         34         35 
##  0.2945566  0.9434740  0.8353211  0.4027095  0.4027095  0.2945566  0.2945566 
##         36         37         38         39         40         41         42 
##  0.4027095  0.3486330  0.3486330  9.7579357  0.3486330  0.8353211  0.2945566 
##         43         44         45         46         47         48         49 
##  0.2945566  0.4567859  0.6190153  0.5108624  0.2945566  0.3486330  0.2945566 
##         50         51         52         53         54         55         56 
##  0.2945566  0.5108624  0.3486330  0.3486330  0.2945566  0.2404801  0.2404801 
##         57         58         59         60         61         62         63 
##  0.2945566  0.2945566  0.2945566  0.2945566  0.2945566  0.2404801  0.2945566 
##         64         65         66         67         68         69         70 
##  0.2945566  0.2945566  0.2404801  0.2404801  0.2404801  0.2945566  0.4567859 
##         71         72         73         74         75         76         77 
##  0.4027095  0.2945566  0.2945566  0.4027095  1.5923914  0.3486330  0.2945566 
##         78         79         80         81         82         83         84 
##  0.2945566  0.2945566  0.3486330  0.4027095  0.2404801  0.2404801  0.2945566 
##         85         86         87         88         89         90         91 
##  0.2945566  0.2945566  0.2404801  0.2404801  0.2945566  0.4027095  0.2404801 
##         92         93         94         95         96         97         98 
##  0.2404801  0.4567859  0.2404801  0.4567859  0.2404801  0.2404801  0.5649388 
##         99        100        101        102        103        104        105 
##  0.2404801  0.9975504  0.2945566  0.2945566  0.2404801  0.2945566  0.4567859 
##        106        107        108        109        110        111        112 
##  0.2404801  0.4027095  0.3486330  0.2404801  0.4027095  0.6730917  0.2945566 
##        113        114        115        116        117        118        119 
##  0.2404801  1.4301621  0.3486330  0.5108624  0.4567859  0.2404801  0.2945566 
##        120        121        122        123        124        125        126 
##  0.2404801  0.5108624  0.2945566  0.2945566  0.2945566  0.2404801  0.2404801 
##        127        128        129        130        131        132        133 
##  0.6190153  0.2945566  0.2404801  0.5108624  0.2404801  0.4027095  0.3486330 
##        134        135        136        137        138        139        140 
##  0.9434740  0.3486330  0.2404801  0.2404801  0.2945566  0.2404801  0.2404801 
##        141        142        143        144        145        146        147 
##  0.7271682  0.3486330  0.3486330  0.2945566  0.2404801  0.2404801  0.3486330 
##        148        149        150        151        152        153        154 
##  0.2404801  0.2404801  0.2404801  0.4027095  0.2404801  0.2404801  0.2945566 
##        155        156        157        158        159        160        161 
##  0.2404801  0.2404801  0.2404801  0.2404801  0.3486330  0.2404801  0.2404801 
##        162        163        164        165        166        167        168 
##  0.2945566  0.2945566  0.2945566  0.2404801  0.3486330  0.2945566  0.2404801 
##        169        170        171        172        173        174        175 
##  0.2404801  0.2404801  0.3486330  0.2945566  0.2945566  0.2404801  0.2404801 
##        176        177        178        179        180        181        182 
##  0.2404801  0.2404801  0.3486330  0.2945566  0.2404801  0.2404801  0.2404801 
##        183        184        185        186        187        188        189 
##  0.2945566  0.2404801  0.2404801  0.2404801  0.2404801  0.2945566  0.2404801 
##        190        191 
##  0.2404801  0.2404801

Explanation: Whole numbers from 1 to 191 in the output depict the row indices or observation numbers in the data utilized in the function predict. That is, for every number, there is an associated observation or data point, and the number next to each number, under the whole numbers, is the model’s predicted value for that observation. These provide a reference with which the predicted values link back to which one in the original dataset they come from-to say which prediction belongs to which observation. This structure helps track the performance of the model or look at specific predictions for individual data points.


33. Interpret predictions using confint

Package: Stats

Definition: Computes confidence intervals for one or more parameters in a fitted model.

Code example:

model1<-lm(tv_audience_share~population_share,fifa_wc_audience)
summary(model1)
## 
## Call:
## lm(formula = tv_audience_share ~ population_share, data = fifa_wc_audience)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7579 -0.2405 -0.1946 -0.0486  5.3454 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.24048    0.07424   3.239  0.00142 ** 
## population_share  0.54076    0.03668  14.741  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9912 on 189 degrees of freedom
## Multiple R-squared:  0.5348, Adjusted R-squared:  0.5324 
## F-statistic: 217.3 on 1 and 189 DF,  p-value: < 2.2e-16
confint(model1, level=0.95)
##                       2.5 %    97.5 %
## (Intercept)      0.09403357 0.3869266
## population_share 0.46840276 0.6131263

Explanation: The confint() function calculates the 95% confidence intervals for the regression coefficients (intercept and slope). This interval gives a range in which we can be 95% confident that the true coefficient values lie.


34. Shapiro-wilk test

function_name

Package:

Definition:

Code example:

Explanation:


35. Non-constant variance test

function_name

Package:

Definition:

Code example:

Explanation:


36. Stepwise model selection using stepAIC and modelaic$anova output

Package:

Definition:

Code example:

Explanation: