This is an R Markdown document. Please fill in all example text with your own words and definitions. For each function in the code glossary, you should show your code. If you need extra code to do some data wrangling or bring in a new library, please hide it from the rendered html. Each term should have: (1) what package does it come from? (2) general definition - what does the function do and for what kind of tasks/situations would you use it? (3) worked example, using your dataset, or one of the example datasets from an R package (4) text below the example that explains specifically, what the code accomplished. In other words, what was the input and the output? What can you do now that you have run the example that you couldn’t before?
Definition: Tidy data is when each column is a variable, each row is an observation, and each cell has one single value. It is useful when we have to make interpretations or manipulate data or both.
Code example (show an example dataset preview or header that you can use to explain):
head(fifa_wc_audience)
## X country confederation population_share tv_audience_share
## 1 1 United States CONCACAF 4.5 4.3
## 2 2 Japan AFC 1.9 4.9
## 3 3 China AFC 19.5 14.8
## 4 4 Germany UEFA 1.2 2.9
## 5 5 Brazil CONMEBOL 2.8 7.1
## 6 6 United Kingdom UEFA 0.9 2.1
## gdp_weighted_share
## 1 11.3
## 2 9.1
## 3 7.3
## 4 6.3
## 5 5.4
## 6 4.2
Explanation: The dataset I preview with the head() function is tidy which is seen by noting that each row represents a particular country and the audience share and gdp share and the confederation which they are a part of. —
summary, str, and
uniquePackage: ggplot
Definition: Summary is a function which shows the summary of various measures of each column such as minimum value,max value,median,mean,etc. Str stands for structure which is a shorter form of summary and it shows what type of variable each observation under that column is. Unique is just another method to show the dataset but without including every single or including only a range of rows.
Code example:
summary(fifa_wc_audience)
## X country confederation population_share
## Min. : 1.0 Length:191 Length:191 Min. : 0.0000
## 1st Qu.: 48.5 Class :character Class :character 1st Qu.: 0.0000
## Median : 96.0 Mode :character Mode :character Median : 0.1000
## Mean : 96.0 Mean : 0.5225
## 3rd Qu.:143.5 3rd Qu.: 0.3500
## Max. :191.0 Max. :19.5000
## tv_audience_share gdp_weighted_share
## Min. : 0.000 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 0.100 Median : 0.0000
## Mean : 0.523 Mean : 0.5204
## 3rd Qu.: 0.300 3rd Qu.: 0.3000
## Max. :14.800 Max. :11.3000
str(fifa_wc_audience)
## 'data.frame': 191 obs. of 6 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ country : chr "United States" "Japan" "China" "Germany" ...
## $ confederation : chr "CONCACAF" "AFC" "AFC" "UEFA" ...
## $ population_share : num 4.5 1.9 19.5 1.2 2.8 0.9 0.9 0.9 2.1 0.7 ...
## $ tv_audience_share : num 4.3 4.9 14.8 2.9 7.1 2.1 2.1 2 3.1 1.8 ...
## $ gdp_weighted_share: num 11.3 9.1 7.3 6.3 5.4 4.2 4 4 3.5 3.1 ...
unique(fifa_wc_audience)
## X country confederation population_share
## 1 1 United States CONCACAF 4.5
## 2 2 Japan AFC 1.9
## 3 3 China AFC 19.5
## 4 4 Germany UEFA 1.2
## 5 5 Brazil CONMEBOL 2.8
## 6 6 United Kingdom UEFA 0.9
## 7 7 Italy UEFA 0.9
## 8 8 France UEFA 0.9
## 9 9 Russia UEFA 2.1
## 10 10 Spain UEFA 0.7
## 11 11 South Korea AFC 0.7
## 12 12 Indonesia AFC 3.5
## 13 13 Mexico CONCACAF 1.7
## 14 14 Turkey UEFA 1.1
## 15 15 Thailand AFC 1.0
## 16 16 Argentina CONMEBOL 0.6
## 17 17 Netherlands UEFA 0.2
## 18 18 Poland UEFA 0.6
## 19 19 Saudi Arabia AFC 0.4
## 20 20 Taiwan AFC 0.3
## 21 21 Canada CONCACAF 0.5
## 22 22 Colombia CONMEBOL 0.7
## 23 23 Venezuela CONMEBOL 0.4
## 24 24 South Africa CAF 0.7
## 25 25 Malaysia AFC 0.4
## 26 26 Switzerland UEFA 0.1
## 27 27 Nigeria CAF 2.3
## 28 28 Belgium UEFA 0.2
## 29 29 Sweden UEFA 0.1
## 30 30 Vietnam AFC 1.3
## 31 31 Iran AFC 1.1
## 32 32 Chile CONMEBOL 0.3
## 33 33 Romania UEFA 0.3
## 34 34 Austria UEFA 0.1
## 35 35 Singapore AFC 0.1
## 36 36 Australia AFC 0.3
## 37 37 Greece UEFA 0.2
## 38 38 Portugal UEFA 0.2
## 39 39 India AFC 17.6
## 40 40 Czech Republic UEFA 0.2
## 41 41 Egypt CAF 1.1
## 42 42 Denmark UEFA 0.1
## 43 43 Norway UEFA 0.1
## 44 44 Peru CONMEBOL 0.4
## 45 45 Ukraine UEFA 0.7
## 46 46 Iraq AFC 0.5
## 47 47 Hungary UEFA 0.1
## 48 48 Kazakhstan UEFA 0.2
## 49 49 Finland UEFA 0.1
## 50 50 Ireland UEFA 0.1
## 51 51 Algeria CAF 0.5
## 52 52 Cuba CONCACAF 0.2
## 53 53 Ecuador CONMEBOL 0.2
## 54 54 Slovakia UEFA 0.1
## 55 55 Qatar AFC 0.0
## 56 56 Kuwait AFC 0.0
## 57 57 Bulgaria UEFA 0.1
## 58 58 Serbia UEFA 0.1
## 59 59 Belarus UEFA 0.1
## 60 60 Hong Kong AFC 0.1
## 61 61 Croatia UEFA 0.1
## 62 62 Oman AFC 0.0
## 63 63 Dominica Republic CONCACAF 0.1
## 64 64 Azerbaijan UEFA 0.1
## 65 65 New Zealand OFC 0.1
## 66 66 Lithuania UEFA 0.0
## 67 67 Slovenia UEFA 0.0
## 68 68 Uruguay CONMEBOL 0.0
## 69 69 Costa Rica CONCACAF 0.1
## 70 70 Uzbekistan AFC 0.4
## 71 71 Yemen CAF 0.3
## 72 72 Israel UEFA 0.1
## 73 73 El Salvador CONCACAF 0.1
## 74 74 Syria AFC 0.3
## 75 75 Pakistan AFC 2.5
## 76 76 Guatemala CONCACAF 0.2
## 77 77 Paraguay CONMEBOL 0.1
## 78 78 Panama CONCACAF 0.1
## 79 79 Bosnia-Herzegovina UEFA 0.1
## 80 80 Cambodia AFC 0.2
## 81 81 Ivory Coast CAF 0.3
## 82 82 Macau AFC 0.0
## 83 83 Latvia UEFA 0.0
## 84 84 Lebanon AFC 0.1
## 85 85 Jordan AFC 0.1
## 86 86 Honduras CONCACAF 0.1
## 87 87 Brunei AFC 0.0
## 88 88 Albania UEFA 0.0
## 89 89 Turkmenistan AFC 0.1
## 90 90 Angola CAF 0.3
## 91 91 Estonia UEFA 0.0
## 92 92 Bahrain AFC 0.0
## 93 93 Nepal AFC 0.4
## 94 94 Cyprus UEFA 0.0
## 95 95 Ghana CAF 0.4
## 96 96 Mauritius CAF 0.0
## 97 97 Macedonia UEFA 0.0
## 98 98 Kenya CAF 0.6
## 99 99 Trinidad &Tobago CONCACAF 0.0
## 100 100 Philippines AFC 1.4
## 101 101 Bolivia CONMEBOL 0.1
## 102 102 Laos AFC 0.1
## 103 103 Armenia UEFA 0.0
## 104 104 Nicaragua CONCACAF 0.1
## 105 105 Afghanistan AFC 0.4
## 106 106 Kosovo UEFA 0.0
## 107 107 Cameroon CAF 0.3
## 108 108 Senegal CAF 0.2
## 109 109 Jamaica CONCACAF 0.0
## 110 110 Sri Lanka AFC 0.3
## 111 111 Myanmar AFC 0.8
## 112 112 Moldova UEFA 0.1
## 113 113 Malta UEFA 0.0
## 114 114 Bangladesh AFC 2.2
## 115 115 Zambia CAF 0.2
## 116 116 Morocco CAF 0.5
## 117 117 North Korea AFC 0.4
## 118 118 Botswana CAF 0.0
## 119 119 Tajikistan AFC 0.1
## 120 120 Iceland UEFA 0.0
## 121 121 Uganda CAF 0.5
## 122 122 Libya CAF 0.1
## 123 123 Palestine AFC 0.1
## 124 124 Kyrgyzstan AFC 0.1
## 125 125 Mongolia AFC 0.0
## 126 126 Montenegro UEFA 0.0
## 127 127 Tanzania CAF 0.7
## 128 128 Georgia UEFA 0.1
## 129 129 Suriname CONCACAF 0.0
## 130 130 Sudan CAF 0.5
## 131 131 Gabon CAF 0.0
## 132 132 Madagascar CAF 0.3
## 133 133 Tunisia AFC 0.2
## 134 134 Ethiopia CAF 1.3
## 135 135 Zimbabwe CAF 0.2
## 136 136 Namibia CAF 0.0
## 137 137 Bahamas CONCACAF 0.0
## 138 138 Papua New Guinea OFC 0.1
## 139 139 Guyana CONCACAF 0.0
## 140 140 Turks & Caicos CONCACAF 0.0
## 141 141 Congo DR CAF 0.9
## 142 142 Burkina Faso CAF 0.2
## 143 143 Guinea CAF 0.2
## 144 144 Haiti CONCACAF 0.1
## 145 145 Fiji OFC 0.0
## 146 146 Barbados CONCACAF 0.0
## 147 147 Mali CAF 0.2
## 148 148 Bermuda CONCACAF 0.0
## 149 149 St. Maarten CONCACAF 0.0
## 150 150 Equatorial Guinea CAF 0.0
## 151 151 Mozambique CAF 0.3
## 152 152 Seychelles CAF 0.0
## 153 153 Cape Verde CAF 0.0
## 154 154 Benin CAF 0.1
## 155 155 Swaziland CAF 0.0
## 156 156 Cayman Islands CONCACAF 0.0
## 157 157 Aruba CONCACAF 0.0
## 158 158 Maldives AFC 0.0
## 159 159 Niger CAF 0.2
## 160 160 Curacao CONCACAF 0.0
## 161 161 St. Lucia CONCACAF 0.0
## 162 162 Togo CAF 0.1
## 163 163 Burundi CAF 0.1
## 164 164 Congo, Rep. CAF 0.1
## 165 165 Antigua & Barbuda CONCACAF 0.0
## 166 166 Chad CAF 0.2
## 167 167 Eritrea CAF 0.1
## 168 168 Grenada CONCACAF 0.0
## 169 169 Lesotho CAF 0.0
## 170 170 St. Vincent CONCACAF 0.0
## 171 171 Malawi CAF 0.2
## 172 172 Sierra Leone CAF 0.1
## 173 173 Mauritania CAF 0.1
## 174 174 Solomon Islands OFC 0.0
## 175 175 Dominica CONCACAF 0.0
## 176 176 Timor AFC 0.0
## 177 177 St. Kitts CONCACAF 0.0
## 178 178 Rwanda CAF 0.2
## 179 179 Somalia CAF 0.1
## 180 180 Gambia CAF 0.0
## 181 181 Samoa OFC 0.0
## 182 182 Guinea-Bissau CAF 0.0
## 183 183 Central African Republic CAF 0.1
## 184 184 Vanuatu OFC 0.0
## 185 185 American Samoa OFC 0.0
## 186 186 Cook Islands OFC 0.0
## 187 187 Tonga OFC 0.0
## 188 188 Liberia CAF 0.1
## 189 189 Palau OFC 0.0
## 190 190 Nauru OFC 0.0
## 191 191 Niue OFC 0.0
## tv_audience_share gdp_weighted_share
## 1 4.3 11.3
## 2 4.9 9.1
## 3 14.8 7.3
## 4 2.9 6.3
## 5 7.1 5.4
## 6 2.1 4.2
## 7 2.1 4.0
## 8 2.0 4.0
## 9 3.1 3.5
## 10 1.8 3.1
## 11 1.8 3.0
## 12 6.7 2.9
## 13 3.2 2.6
## 14 2.3 2.0
## 15 2.4 1.6
## 16 1.5 1.6
## 17 0.6 1.5
## 18 1.2 1.3
## 19 0.5 1.2
## 20 0.5 1.0
## 21 0.5 1.0
## 22 1.6 0.9
## 23 1.0 0.9
## 24 1.3 0.8
## 25 0.7 0.7
## 26 0.3 0.7
## 27 2.6 0.7
## 28 0.3 0.7
## 29 0.3 0.7
## 30 2.6 0.6
## 31 0.7 0.6
## 32 0.6 0.6
## 33 0.7 0.6
## 34 0.3 0.6
## 35 0.2 0.6
## 36 0.3 0.5
## 37 0.3 0.5
## 38 0.4 0.5
## 39 2.0 0.5
## 40 0.3 0.5
## 41 0.8 0.5
## 42 0.2 0.5
## 43 0.1 0.4
## 44 0.8 0.4
## 45 0.9 0.4
## 46 0.5 0.4
## 47 0.3 0.4
## 48 0.3 0.3
## 49 0.2 0.3
## 50 0.1 0.3
## 51 0.4 0.3
## 52 0.3 0.3
## 53 0.5 0.3
## 54 0.2 0.3
## 55 0.0 0.2
## 56 0.1 0.2
## 57 0.2 0.2
## 58 0.3 0.2
## 59 0.2 0.2
## 60 0.1 0.2
## 61 0.1 0.1
## 62 0.0 0.1
## 63 0.2 0.1
## 64 0.1 0.1
## 65 0.1 0.1
## 66 0.1 0.1
## 67 0.1 0.1
## 68 0.1 0.1
## 69 0.2 0.1
## 70 0.5 0.1
## 71 0.4 0.1
## 72 0.1 0.1
## 73 0.2 0.1
## 74 0.3 0.1
## 75 0.4 0.1
## 76 0.2 0.1
## 77 0.2 0.1
## 78 0.1 0.1
## 79 0.1 0.1
## 80 0.5 0.1
## 81 0.4 0.1
## 82 0.0 0.1
## 83 0.1 0.1
## 84 0.1 0.1
## 85 0.1 0.1
## 86 0.3 0.1
## 87 0.0 0.1
## 88 0.1 0.1
## 89 0.1 0.1
## 90 0.1 0.1
## 91 0.0 0.0
## 92 0.0 0.0
## 93 0.4 0.0
## 94 0.0 0.0
## 95 0.2 0.0
## 96 0.0 0.0
## 97 0.1 0.0
## 98 0.3 0.0
## 99 0.0 0.0
## 100 0.1 0.0
## 101 0.1 0.0
## 102 0.2 0.0
## 103 0.1 0.0
## 104 0.1 0.0
## 105 0.3 0.0
## 106 0.1 0.0
## 107 0.2 0.0
## 108 0.2 0.0
## 109 0.1 0.0
## 110 0.1 0.0
## 111 0.1 0.0
## 112 0.1 0.0
## 113 0.0 0.0
## 114 0.1 0.0
## 115 0.1 0.0
## 116 0.1 0.0
## 117 0.2 0.0
## 118 0.0 0.0
## 119 0.1 0.0
## 120 0.0 0.0
## 121 0.2 0.0
## 122 0.0 0.0
## 123 0.1 0.0
## 124 0.1 0.0
## 125 0.0 0.0
## 126 0.0 0.0
## 127 0.1 0.0
## 128 0.0 0.0
## 129 0.0 0.0
## 130 0.1 0.0
## 131 0.0 0.0
## 132 0.1 0.0
## 133 0.0 0.0
## 134 0.2 0.0
## 135 0.1 0.0
## 136 0.0 0.0
## 137 0.0 0.0
## 138 0.1 0.0
## 139 0.0 0.0
## 140 0.0 0.0
## 141 0.2 0.0
## 142 0.1 0.0
## 143 0.1 0.0
## 144 0.1 0.0
## 145 0.0 0.0
## 146 0.0 0.0
## 147 0.0 0.0
## 148 0.0 0.0
## 149 0.0 0.0
## 150 0.0 0.0
## 151 0.1 0.0
## 152 0.0 0.0
## 153 0.0 0.0
## 154 0.0 0.0
## 155 0.0 0.0
## 156 0.0 0.0
## 157 0.0 0.0
## 158 0.0 0.0
## 159 0.1 0.0
## 160 0.0 0.0
## 161 0.0 0.0
## 162 0.0 0.0
## 163 0.1 0.0
## 164 0.0 0.0
## 165 0.0 0.0
## 166 0.0 0.0
## 167 0.0 0.0
## 168 0.0 0.0
## 169 0.0 0.0
## 170 0.0 0.0
## 171 0.0 0.0
## 172 0.0 0.0
## 173 0.0 0.0
## 174 0.0 0.0
## 175 0.0 0.0
## 176 0.0 0.0
## 177 0.0 0.0
## 178 0.0 0.0
## 179 0.0 0.0
## 180 0.0 0.0
## 181 0.0 0.0
## 182 0.0 0.0
## 183 0.0 0.0
## 184 0.0 0.0
## 185 0.0 0.0
## 186 0.0 0.0
## 187 0.0 0.0
## 188 0.0 0.0
## 189 0.0 0.0
## 190 0.0 0.0
## 191 0.0 0.0
Explanation: After we run the code each function shows us a different result about the data set which could be helpful in further analysis.
Demonstrate each of the main dplyr verbs in
Questions 2-7. In your examples, these may be used alone or together
with other functions, but your definition and explanation must focus
specifically on the function in the prompt.
select to manipulate a dataframePackage: dplyr
Definition: Select function means to literally select a group of columns and show only those columns in the data set
Code example:
country_tv_share<-fifa_wc_audience%>%
select(country,confederation,tv_audience_share)
head(country_tv_share)
## country confederation tv_audience_share
## 1 United States CONCACAF 4.3
## 2 Japan AFC 4.9
## 3 China AFC 14.8
## 4 Germany UEFA 2.9
## 5 Brazil CONMEBOL 7.1
## 6 United Kingdom UEFA 2.1
Explanation:When we run the code we can see that I asked R to ‘select’ only the country,confederation,and tv_audience_share columns and give me a new dataset only with those columns
arrange to manipulate a dataframePackage: dplyr
Definition: The arrange function arranges your data set in a high to low or a low to high order of a particular column.
Code example:
fifa_wc_audience1<-fifa_wc_audience%>%
arrange(desc(gdp_weighted_share))
head(fifa_wc_audience1)
## X country confederation population_share tv_audience_share
## 1 1 United States CONCACAF 4.5 4.3
## 2 2 Japan AFC 1.9 4.9
## 3 3 China AFC 19.5 14.8
## 4 4 Germany UEFA 1.2 2.9
## 5 5 Brazil CONMEBOL 2.8 7.1
## 6 6 United Kingdom UEFA 0.9 2.1
## gdp_weighted_share
## 1 11.3
## 2 9.1
## 3 7.3
## 4 6.3
## 5 5.4
## 6 4.2
Explanation: When we run the code we can see that now the gdp_share is arranged from a high to low order. When we arrange from high to low we use arrange(desc()) which stands for descending and only arrange() for ascending
filter to manipulate a dataframePackage: dplyr
Definition: The filter function filters your data and keeps certain rows which satisfies the condition you put for the filter function. We can add multiple filter conditions.
Code example:
fifa_wc_audience2<-fifa_wc_audience%>%
filter(population_share>2.0,gdp_weighted_share>1.0)
head(fifa_wc_audience2)
## X country confederation population_share tv_audience_share
## 1 1 United States CONCACAF 4.5 4.3
## 2 3 China AFC 19.5 14.8
## 3 5 Brazil CONMEBOL 2.8 7.1
## 4 9 Russia UEFA 2.1 3.1
## 5 12 Indonesia AFC 3.5 6.7
## gdp_weighted_share
## 1 11.3
## 2 7.3
## 3 5.4
## 4 3.5
## 5 2.9
Explanation: Here I only wanted to know about the countries whose gdp weightage was higher than 1.0 and their population share was higher than 2.0.
mutate to manipulate a dataframePackage: dplyr
Definition: Mutate is used to mutate a dataset. We can add new columns as functions of existing variables, modify existing columns, and delete columns
Code example:
fifa_wc_audience3<-fifa_wc_audience%>%
mutate(popular_football_country=ifelse(tv_audience_share>=1.5,1,0))
head(fifa_wc_audience3)
## X country confederation population_share tv_audience_share
## 1 1 United States CONCACAF 4.5 4.3
## 2 2 Japan AFC 1.9 4.9
## 3 3 China AFC 19.5 14.8
## 4 4 Germany UEFA 1.2 2.9
## 5 5 Brazil CONMEBOL 2.8 7.1
## 6 6 United Kingdom UEFA 0.9 2.1
## gdp_weighted_share popular_football_country
## 1 11.3 1
## 2 9.1 1
## 3 7.3 1
## 4 6.3 1
## 5 5.4 1
## 6 4.2 1
Explanation: Here I chose to add another column which would tell me if football is particularly popular in that country and I did that by making a new column which shows 1 is a country has a tv audience share greater than equal to 1.5 and shows 0 otherwise.
summarize to manipulate a dataframePackage: dplyr
Definition: Summarize function creates a new data frame which returns one row for a combination of grouping variables
Code example:
fifa_wc_audience4<-fifa_wc_audience%>%
group_by(confederation)%>%
summarise(mean(gdp_weighted_share))
head(fifa_wc_audience4)
## # A tibble: 6 × 2
## confederation `mean(gdp_weighted_share)`
## <chr> <dbl>
## 1 AFC 0.735
## 2 CAF 0.052
## 3 CONCACAF 0.527
## 4 CONMEBOL 1.03
## 5 OFC 0.00833
## 6 UEFA 0.848
Explanation: When we run the code we see that there is a whole new dataset created which has been made by each observation being a particular confederation and from that we summarized the mean of the gdp_weighted_share for each confederation and showed it as a seperate column.
group_by to manipulate a dataframePackage: dplyr
Definition: It is a function used to group data according to one or multiple variables. Using group_by() each time splits the data into subsets on the basis of the variables in the grouping. After grouping data, you can then summarize or transform within each group independently of another group.
Code example:
fifa_wc_audience5<-fifa_wc_audience%>%
group_by(confederation)%>%
summarise(mean(gdp_weighted_share),
mean(tv_audience_share),
mean(population_share))
head(fifa_wc_audience5)
## # A tibble: 6 × 4
## confederation `mean(gdp_weighted_share)` `mean(tv_audience_share)`
## <chr> <dbl> <dbl>
## 1 AFC 0.735 0.991
## 2 CAF 0.052 0.172
## 3 CONCACAF 0.527 0.327
## 4 CONMEBOL 1.03 1.35
## 5 OFC 0.00833 0.0167
## 6 UEFA 0.848 0.548
## # ℹ 1 more variable: `mean(population_share)` <dbl>
Explanation: When we run the code we see that there is a whole new dataset created which has been grouped by the confederation which means is observation is now a different confederation rather than a different country
Describe and demonstrate at least two different ways of handling NA values that might appear in a data frame or in a column of a data frame that you are trying to work with.
function_names
Package: ggplot,dplyr
Definition: NA values are values that are missing from the dataset. NA stands for Not Applicable. While computing certain parameters NA values may interfere in the calculation so it is important to remove them.
Code example 1:
mean(us_voter_turnout$votes)
## [1] NA
mean(us_voter_turnout$votes,na.rm=TRUE)
## [1] 3074280
Code example 2:
us_voter_turnout_by_year<-us_voter_turnout%>%
group_by(year)%>%
summarise(
count=sum(!is.na(votes)),
mean_voter_ratio=mean(votes/eligible_voters,na.rm=TRUE))
Explanation: From the two above code examples we can see two different ways to exclude NA values from a data set. In the first code, we can see the mean value of votes without excluding the NA values is NA but after we use na.rm=TRUE we exclude the missing values and there is a value of the mean.In the second code I used dplyr to group and summarize the counts of states whose data was present in that particular year. The states whose data was missing, showed NA values. I wanted to count only the states whose data was present, so I used !is.na to show the count without the NA values. Using is.na in itself would show the count only with the NA values.
ifelse
Package:
Definition:
Code example:
Explanation:
rbind
Package:
Definition:
Code example:
Explanation:
merge and/or
joinPackage:
Definition:
Code example:
Explanation:
stringr functions to
manipulate text datafunction_name
Package:
Definition:
Code example:
Explanation:
function_name
Package:
Definition:
Code example:
Explanation:
function
Package:
Definition:
Code example:
Explanation:
**All visuals should use functions from the ggplot2
and/or ggmap libraries
Package: ggplot
Definition: A histogram is a data visualization which divides the x axis into bins and counts the number of observations in each bin and displays it with bars
Code example:
ggplot(fifa_wc_audience,aes(x=gdp_weighted_share,))+
geom_histogram(color="black",fill="skyblue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Explanation: After running the code we see that a clear histogram has been created which measures the gdp shares of various countries. From the histogram we see that majority of the countries listed in the data contribute about 0-1% of the entire worlds gdp.
Package: ggplot
Definition: It is a polygon with straight lines, used like a histogram but insead of displaying the frequency with bars the frequency is explained with lines.
Code example:
ggplot(tdd_results,aes(x=age))+
geom_freqpoly(binwidth=2, color="blue") +
labs(title = "Frequency Polygon of Age", x = "Age", y = "Frequency")
Explanation: When we run the code we can see that we get a polygon which shows us the frequency of the ages of people taking part in a donut race in Ohio. It is better suited for this type of variable since Age can be categorically divided into different levels.
Package: ggplot
Definition: A boxplot compactly displays the distribution of a continuous variable. It visualizes five summary statistics (the median, two hinges and two whiskers), and all “outlying” points individually.
Code example:
ggplot(fifa_wc_audience,aes(x=confederation,y=tv_audience_share))+
geom_boxplot()
Explanation: The box plot shows that UEFA and CONMEBOL have the highest and most varied TV audience shares, with UEFA having the largest number of outliers. CONMEBOL displays a more consistent range of values, though still with a few high outliers. In contrast, AFC, CAF, CONCACAF, and OFC have much lower audience shares, with fewer outliers and less variability. This suggests UEFA and CONMEBOL teams generally attract larger TV audiences compared to other confederations.
Package: ggplot
Definition: A scatterplot is most useful for displaying the relationship between two continuous variables. It can be used to compare one continuous and one categorical variable, or two categorical variables
Code example:
ggplot(fifa_wc_audience,aes(x=gdp_weighted_share,y=tv_audience_share))+
geom_point()+
geom_smooth(method="lm", linetype=1)
Explanation: With the above graph I wanted to see whether a country with a higher gdp would generate a larger TV audience share. To do that I plotted the independent variable on the X axis and the dependent one on the Y axis. From the graph we can see a weak positive correlation between the two variables.
Package: ggplot
Definition: A line plot is a kind of chart that joins data points with lines to show trends in data over time or categories. The x-axis shows the time or categories the measurement occurred in, and the y-axis shows the value. It’s helpful for finding patterns such as rises or falls, and it works effectively to compare many data sets on one graph.
Code example:
us_voter_turnout_by_year<-us_voter_turnout%>%
group_by(year)%>%
summarise(
mean_votes=mean(votes,na.rm=TRUE)
)
ggplot(us_voter_turnout_by_year,aes(x=year,y=mean_votes))+
geom_line()+
labs(title="Distribuition of Votes by Year",
x="Year",
y="Votes")
Explanation: If we wanted to see how the mean number of votes has increased over time then we can run a line plot wherein we can see how many votes there were for each congress session. Note: The graph has dips every two years since each session lasts two years.
function_name
Package: ggplot
Definition: Barplot is a graph used to compute frequencies of non-continuous variables. It is similar to a histogram and used bars to display the counts but a histogram shows frequencies for continous variables.
Code example:
ggplot(fifa_wc_audience,aes(x=confederation))+
geom_bar(position = "dodge", alpha = 0.8, color="black",fill="blue") +
labs(title = "Bar Plot of No.of Countries in Confederations",
x = "Confederation", y = "Count")
Explanation: Here I wanted to found the number of countries in each confederation. Confederations, being a non continuous categorical variable fits in the x axis of the bar plot. We can easily conclude that CAF has the highest number of countries while CONMEBOL has the lowest number of countries in the data.
function_namePackage:
Definition:
Code example:
Explanation:
kable, xtable,
or pander**function_name
Package:
Definition:
Code example:
Explanation:
mean, median, max,
min, interquartile range,
standard deviaton You may choose to use one or more
functions or code statements to demonstrate and explain
Package: dplyr
Definition: These tools allow for the calculation of key summary statistics, providing insights into the distribution and characteristics of a dataset. Here, it demonstrates how to calculate the mean, median, maximum, minimum, interquartile range, and standard deviation using the dplyr package
Code example:
summary_stats <- fifa_wc_audience %>%
summarise(
Mean = mean(population_share,na.rm=TRUE),
Median = median(population_share),
Maximum = max(population_share),
Minimum = min(population_share),
Interquartile_Range = IQR(population_share),
Standard_Deviation = sd(population_share)
)
print(summary_stats)
## Mean Median Maximum Minimum Interquartile_Range Standard_Deviation
## 1 0.5225131 0.1 19.5 0 0.35 1.960335
Explanation: In this code, the summarise calculates key summary statistics for the ‘Mutual_Funds’ column in the dataset. The summarise function computes multiple summary statistics in a single step.
mean(population_share) calculates the mean (average) value of the ‘Mutual_Funds’ column.
median(population_share) computes the median, which represents the middle value when the data is sorted in ascending order.
max(population_share) calculates the maximum value in the ‘population_share’ column.
min(population_share) computes the minimum value in the ‘population_share’ column.
IQR(population_share) calculates the interquartile range, which is the range between the 25th and 75th percentiles of the data.
sd(population_share) calculates the standard deviation, which measures the dispersion of data points around the mean.
Package: Hmisc
Definition: T-test is one category of hypothesis tests, it compares the means of two groups to determine if they are statistically different. A one-sample t-test compares the mean of a single sample to a known or hypothesized population mean to see if there is a significant difference between them
Code example:
t.test(fifa_wc_audience$population_share,mu=0.5,na.rm=TRUE)
##
## One Sample t-test
##
## data: fifa_wc_audience$population_share
## t = 0.15872, df = 190, p-value = 0.8741
## alternative hypothesis: true mean is not equal to 0.5
## 95 percent confidence interval:
## 0.2427202 0.8023060
## sample estimates:
## mean of x
## 0.5225131
Explanation: The one-sample t-test was conducted on the ‘population_share’ column from the fifa_wc_audience dataset. The hypothesis being tested is whether the mean population share of the dataset is significantly different from 0.5 (the hypothesized mean). This analysis provides valuable statistical information about the mean age in the finance dataset, helping to understand whether it significantly differs from the hypothesized mean of 0.5. Here, the p value was significantly high which means we cannot reject our nulll hypothesis of mu= 0.5.
Package: Hmisc
Definition: The two sample t-test compares the means of two independent samples to see if there is a significant difference between them.
Code example:
fifa_wc_audience_americas<-fifa_wc_audience%>%
filter(confederation=="CONCACAF"|confederation=="CONMEBOL")
t.test(fifa_wc_audience_americas$tv_audience_share~fifa_wc_audience_americas$confederation)
##
## Welch Two Sample t-test
##
## data: fifa_wc_audience_americas$tv_audience_share by fifa_wc_audience_americas$confederation
## t = -1.4978, df = 10.267, p-value = 0.1643
## alternative hypothesis: true difference in means between group CONCACAF and group CONMEBOL is not equal to 0
## 95 percent confidence interval:
## -2.5402968 0.4936302
## sample estimates:
## mean in group CONCACAF mean in group CONMEBOL
## 0.3266667 1.3500000
Explanation: Here we are trying to see if the mean difference between the two confederations of CONMEBOL and CONCACAF in tv_audience_shares is 0 or not. Our p value is 0.16 which is higher than alpha, and thus we do not have evidence to reject the null hypothesis of the mean difference being 0
cor and cor.test
Package: Hmisc
Definition: This function allows for the calculation of multiple correlation coefficients simultaneously, providing insights into both linear and rank-based relationships between variables. This test allows for the evaluation of the statistical significance of the correlations observed in the data
Code example:
cor(fifa_wc_audience$population_share, fifa_wc_audience$tv_audience_share)
## [1] 0.7313239
cor.test(fifa_wc_audience$population_share, fifa_wc_audience$tv_audience_share)
##
## Pearson's product-moment correlation
##
## data: fifa_wc_audience$population_share and fifa_wc_audience$tv_audience_share
## t = 14.741, df = 189, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6576280 0.7911553
## sample estimates:
## cor
## 0.7313239
Explanation: Both the measures tell us the correlation between the two variables in the dataset. As we can see cor in itself only tells us the amount of correlation between the two variables. cor.test however gives us the correlation, with the t statistic and the p value. cor.test is the more significant one since it allows us to see if our obtained correlation is statistically significant or not.
lm()
Package: Stats
Definition: It is used to identify if two variables are related to each other. For example to check if a change in variable x has an effect on variable y
Code example:
model1<-lm(tv_audience_share~population_share,fifa_wc_audience)
summary(model1)
##
## Call:
## lm(formula = tv_audience_share ~ population_share, data = fifa_wc_audience)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7579 -0.2405 -0.1946 -0.0486 5.3454
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.24048 0.07424 3.239 0.00142 **
## population_share 0.54076 0.03668 14.741 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9912 on 189 degrees of freedom
## Multiple R-squared: 0.5348, Adjusted R-squared: 0.5324
## F-statistic: 217.3 on 1 and 189 DF, p-value: < 2.2e-16
Explanation: From the result above we identify that if a graph was to be made with these variables, the slope would be 0.54. This signifies that there is a weak positive relationship between the variables. If population_share increases by 1, the tv_audience_share increases by 0.54. The relationship is verified as the p value is very low. Note: The null hypothesis for a linear regression test is that there is 0 relation between the two variables, ie. the slope is 0.
lm()
Package: Stats
Definition: : It estimates the coefficients for a linear model with multiple predictor variables and one outcome variable.
Code example:
model2<-lm(tv_audience_share~population_share+gdp_weighted_share,fifa_wc_audience)
summary(model2)
##
## Call:
## lm(formula = tv_audience_share ~ population_share + gdp_weighted_share,
## data = fifa_wc_audience)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6141 -0.0850 -0.0493 0.0150 3.8005
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.04928 0.05275 0.934 0.351
## population_share 0.35733 0.02822 12.661 <2e-16 ***
## gdp_weighted_share 0.55157 0.03796 14.532 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6821 on 188 degrees of freedom
## Multiple R-squared: 0.7809, Adjusted R-squared: 0.7786
## F-statistic: 335.1 on 2 and 188 DF, p-value: < 2.2e-16
Explanation: Now, we are comparing what effect does gdp weighted share have, in addition to the population share on tv audience share which we tested in #27. The important thing to note is that the coefficient estimates for both the variables we’re are estimated only if the other one is held constant. This means that the 0.357 value of population share is the value only if gdp wweighted share was held constant during the test.
ggpairs
Package: GGally
Definition: Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated with each other, which can lead to issues with the model’s stability and interpretation.
Code example:
ggpairs(fifa_wc_audience[, c("tv_audience_share", "population_share", "gdp_weighted_share")])
Explanation: This uses the function ggpairs() provided by the GGally library to create a pair plot to visually explore associations between the variables tv_audience_share, population_share, and gdp_weighted_share in the fifa_wc_audience data frame. In this pair plot, it shows the scatterplots for each pair of variables, distributions of individual variables, and their correlation coefficients, which may give an overview about how each variable relates to one another. This plot helps in the identification of multicollinearity-a situation where independent variables are highly interrelated. Any pair of two variables showing very high values of correlation in this pair plot could suggest the existence of multicollinearity; such variables would carry information that overlaps in regression models. Multicollinearity can be detected and thus curbed since it always results in insecure estimates of coefficients of regression.
gg_diagnose or
autoplot
Package: ggfortify
Definition: A group of 4 graphs that validate our regression model and help us check the assumptions of a linear regression
Code example:
model1<-lm(tv_audience_share~population_share,fifa_wc_audience)
summary(model1)
##
## Call:
## lm(formula = tv_audience_share ~ population_share, data = fifa_wc_audience)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7579 -0.2405 -0.1946 -0.0486 5.3454
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.24048 0.07424 3.239 0.00142 **
## population_share 0.54076 0.03668 14.741 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9912 on 189 degrees of freedom
## Multiple R-squared: 0.5348, Adjusted R-squared: 0.5324
## F-statistic: 217.3 on 1 and 189 DF, p-value: < 2.2e-16
autoplot(model1)
Explanation: Autoplot gives us 4 graphs to check our regression. The first plot shows a random scatter around zero without any specific patterns or “fan shapes”,ie. if the dataset starts fanning out after a point or not. The second plot (Q-Q plot) shows us the same thing as a histogram of residuals which is if the data is normally distributed. Points should roughly follow the line. Deviations suggest non-normality, especially at the tails
as.data.frame, geom_histogram
Package: ggplot
Definition: Residuals is the difference between the predicted value and the actual value in a dataset. One of the assumptions of a linear regression is that there should be normality in the dataset which is proven if a histogram of residuals is normally distributed or not.
Code example:
resid<-as.data.frame(model1$residuals)
colnames(resid)<-"residuals"
ggplot(resid, aes(residuals))+
geom_histogram(binwidth = 0.6, fill="red",
color="black", alpha=0.7)+
labs(title="Histogram of Residuals",
x="Residuals",
y="Frequency")
Explanation: It first pulls the residuals-the differences between actual and predicted values-out of the model, and transforms it into a data frame for easier manipulation. This uses ggplot2 to plot a histogram of the residuals, mapping the column residuals to the x-axis. This histogram displays the distribution of residuals, which is useful to check whether residuals are approximately normally distributed-a fundamental assumption for the linear regression models.
predictPackage: Stats
Definition: Function for predictions from the results of various model fitting functions.
Code example:
predicted_values <- predict(model1, data = fifa_wc_audience)
predicted_values
## 1 2 3 4 5 6 7
## 2.6739205 1.2679327 10.7853883 0.8893975 1.7546208 0.7271682 0.7271682
## 8 9 10 11 12 13 14
## 0.7271682 1.3760856 0.6190153 0.6190153 2.1331559 1.1597798 0.8353211
## 15 16 17 18 19 20 21
## 0.7812446 0.5649388 0.3486330 0.5649388 0.4567859 0.4027095 0.5108624
## 22 23 24 25 26 27 28
## 0.6190153 0.4567859 0.6190153 0.4567859 0.2945566 1.4842385 0.3486330
## 29 30 31 32 33 34 35
## 0.2945566 0.9434740 0.8353211 0.4027095 0.4027095 0.2945566 0.2945566
## 36 37 38 39 40 41 42
## 0.4027095 0.3486330 0.3486330 9.7579357 0.3486330 0.8353211 0.2945566
## 43 44 45 46 47 48 49
## 0.2945566 0.4567859 0.6190153 0.5108624 0.2945566 0.3486330 0.2945566
## 50 51 52 53 54 55 56
## 0.2945566 0.5108624 0.3486330 0.3486330 0.2945566 0.2404801 0.2404801
## 57 58 59 60 61 62 63
## 0.2945566 0.2945566 0.2945566 0.2945566 0.2945566 0.2404801 0.2945566
## 64 65 66 67 68 69 70
## 0.2945566 0.2945566 0.2404801 0.2404801 0.2404801 0.2945566 0.4567859
## 71 72 73 74 75 76 77
## 0.4027095 0.2945566 0.2945566 0.4027095 1.5923914 0.3486330 0.2945566
## 78 79 80 81 82 83 84
## 0.2945566 0.2945566 0.3486330 0.4027095 0.2404801 0.2404801 0.2945566
## 85 86 87 88 89 90 91
## 0.2945566 0.2945566 0.2404801 0.2404801 0.2945566 0.4027095 0.2404801
## 92 93 94 95 96 97 98
## 0.2404801 0.4567859 0.2404801 0.4567859 0.2404801 0.2404801 0.5649388
## 99 100 101 102 103 104 105
## 0.2404801 0.9975504 0.2945566 0.2945566 0.2404801 0.2945566 0.4567859
## 106 107 108 109 110 111 112
## 0.2404801 0.4027095 0.3486330 0.2404801 0.4027095 0.6730917 0.2945566
## 113 114 115 116 117 118 119
## 0.2404801 1.4301621 0.3486330 0.5108624 0.4567859 0.2404801 0.2945566
## 120 121 122 123 124 125 126
## 0.2404801 0.5108624 0.2945566 0.2945566 0.2945566 0.2404801 0.2404801
## 127 128 129 130 131 132 133
## 0.6190153 0.2945566 0.2404801 0.5108624 0.2404801 0.4027095 0.3486330
## 134 135 136 137 138 139 140
## 0.9434740 0.3486330 0.2404801 0.2404801 0.2945566 0.2404801 0.2404801
## 141 142 143 144 145 146 147
## 0.7271682 0.3486330 0.3486330 0.2945566 0.2404801 0.2404801 0.3486330
## 148 149 150 151 152 153 154
## 0.2404801 0.2404801 0.2404801 0.4027095 0.2404801 0.2404801 0.2945566
## 155 156 157 158 159 160 161
## 0.2404801 0.2404801 0.2404801 0.2404801 0.3486330 0.2404801 0.2404801
## 162 163 164 165 166 167 168
## 0.2945566 0.2945566 0.2945566 0.2404801 0.3486330 0.2945566 0.2404801
## 169 170 171 172 173 174 175
## 0.2404801 0.2404801 0.3486330 0.2945566 0.2945566 0.2404801 0.2404801
## 176 177 178 179 180 181 182
## 0.2404801 0.2404801 0.3486330 0.2945566 0.2404801 0.2404801 0.2404801
## 183 184 185 186 187 188 189
## 0.2945566 0.2404801 0.2404801 0.2404801 0.2404801 0.2945566 0.2404801
## 190 191
## 0.2404801 0.2404801
Explanation: Whole numbers from 1 to 191 in the output depict the row indices or observation numbers in the data utilized in the function predict. That is, for every number, there is an associated observation or data point, and the number next to each number, under the whole numbers, is the model’s predicted value for that observation. These provide a reference with which the predicted values link back to which one in the original dataset they come from-to say which prediction belongs to which observation. This structure helps track the performance of the model or look at specific predictions for individual data points.
confintPackage: Stats
Definition: Computes confidence intervals for one or more parameters in a fitted model.
Code example:
model1<-lm(tv_audience_share~population_share,fifa_wc_audience)
summary(model1)
##
## Call:
## lm(formula = tv_audience_share ~ population_share, data = fifa_wc_audience)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7579 -0.2405 -0.1946 -0.0486 5.3454
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.24048 0.07424 3.239 0.00142 **
## population_share 0.54076 0.03668 14.741 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9912 on 189 degrees of freedom
## Multiple R-squared: 0.5348, Adjusted R-squared: 0.5324
## F-statistic: 217.3 on 1 and 189 DF, p-value: < 2.2e-16
confint(model1, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 0.09403357 0.3869266
## population_share 0.46840276 0.6131263
Explanation: The confint() function calculates the 95% confidence intervals for the regression coefficients (intercept and slope). This interval gives a range in which we can be 95% confident that the true coefficient values lie.
function_name
Package:
Definition:
Code example:
Explanation:
function_name
Package:
Definition:
Code example:
Explanation:
stepAIC and
modelaic$anova outputPackage:
Definition:
Code example:
Explanation: