0.0.1 First things first.

Install and load all required packages.

#nstall.packages("devtools")
#nstall.packages("tidyverse")
#nstall.packages("kableExtra")
#nstall.packages("visreg")
#nstall.packages("stargazer")
#nstall.packages("ggrepel")
#nstall.packages("gridExtra")
#nstall.packages("fBasics")
#nstall.packages("DescTools")
#nstall.packages("ggmosaic")
install.packages(descr)
Error in install.packages : object 'descr' not found
#install.packages("devtools")
#install.packages("tidyverse")
#ibrary(tidyverse)
#ibrary(kableExtra)
#ibrary(visreg)
#ibrary(stargazer)
#ibrary(ggrepel)
#ibrary(gridExtra)
#ibrary(fBasics)
#lbrary(DescTools)
#ibrary(ggmosaic)
library(descr)

0.0.2 Exploring the data

Descriptive, summary statistics helps us generate useful descriptions of our dataset, identify data anomalies and help us frame our questions and hypothesis. The aim of this tutorial/quiz is to help you become familiar with useful R functions and packages for descriptive analysis.

Let’s create a subset of the state data. The variable state is the state’s name, region indicates the region, trumpwin records whether Donald Trump won the state’s electoral college in the 2016 race, percwom is the average woman’s salary relative to men, and inc records per capita income. Note: remember to install and load your packages.

states<-read_csv(file.choose())
New names:Rows: 50 Columns: 45── Column specification ─────────────────────────
Delimiter: ","
chr  (3): state, st, region
dbl (42): ...1, raperate, murderrate, abort, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
stdat <- head(dplyr::select(states, state, region, trumpwin, percwom, inc))
head(stdat)

And a subset of the world data:

world<-read_csv(file.choose())
New names:Rows: 182 Columns: 45── Column specification ─────────────────────────
Delimiter: ","
chr  (8): iso3c, region, country, dpicode, ac...
dbl (36): ...1, fdi, nourish, aid, oil, homic...
lgl  (1): lifeexp
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
wdat <- world %>% select(country, regime,aclpregion, fdi, fhliberties, inf,nourish, health, gdppc, turnout, womyear)
head(wdat)

####Tables

We are going to use the kable() function to create good-looking tables. Take a moment to explore the package documentation using RStudio help: type: ?kableExtra or use the link: https://haozhu233.github.io/kableExtra/awesome_table_in_html.html

?kableExtra

Let’s use it here to create a formatted table of the dataframe “wdat” we just created with a subset of the world data. The package includes six ready-to-use themes. Let’s try the kable_minimal(), and add it with a pipe operator like in this example. Since we are not using any summary stats in our table, we can use the head() function to just print the first rows of our wdat


head(wdat) %>% kbl()%>% 
  kable_minimal()
country regime aclpregion fdi fhliberties inf nourish health gdppc turnout womyear
Afghanistan Civilian Dictatorship Eastern Europe/Soviet Union 0.3400968 6 75.1 24.7 9.197723 1629.167 45.83 NA
Angola Civilian Dictatorship Sub-Saharan Africa -3.9131508 5 109.6 20.7 3.391146 6360.849 62.77 1975
Albania Parliamentary Democracy Eastern Europe/Soviet Union 9.1340709 3 14.8 NA 5.335035 9646.582 53.31 1920
United Arab Emirates Royal Dictatorship Oil States 3.0752631 5 7.3 5.0 3.929452 56245.478 NA NA
Argentina Presidential Democracy Latin America 2.6751617 2 13.0 5.0 6.550156 18333.995 81.07 1947
Armenia Mixed Democracy Eastern Europe/Soviet Union 5.7160378 4 16.1 6.5 4.562263 6376.268 62.87 1921

now let’s add a caption to our table.

#now let's add a caption to our table
head(wdat) %>%   kbl(caption = "Table a: World Data")%>% 
        kable_minimal()
Table a: World Data
country regime aclpregion fdi fhliberties inf nourish health gdppc turnout womyear
Afghanistan Civilian Dictatorship Eastern Europe/Soviet Union 0.3400968 6 75.1 24.7 9.197723 1629.167 45.83 NA
Angola Civilian Dictatorship Sub-Saharan Africa -3.9131508 5 109.6 20.7 3.391146 6360.849 62.77 1975
Albania Parliamentary Democracy Eastern Europe/Soviet Union 9.1340709 3 14.8 NA 5.335035 9646.582 53.31 1920
United Arab Emirates Royal Dictatorship Oil States 3.0752631 5 7.3 5.0 3.929452 56245.478 NA NA
Argentina Presidential Democracy Latin America 2.6751617 2 13.0 5.0 6.550156 18333.995 81.07 1947
Armenia Mixed Democracy Eastern Europe/Soviet Union 5.7160378 4 16.1 6.5 4.562263 6376.268 62.87 1921

now let’s try another theme, this time the kable_classic [note: Other themes to consides: kable_classic(), kable_paper(), kable_classic_2(),kable_minimal(), kable_material()]

#now let's try another 
head(wdat) %>%   kbl(caption = "Table a: World Data")%>% 
        kable_classic()
Table a: World Data
country regime aclpregion fdi fhliberties inf nourish health gdppc turnout womyear
Afghanistan Civilian Dictatorship Eastern Europe/Soviet Union 0.3400968 6 75.1 24.7 9.197723 1629.167 45.83 NA
Angola Civilian Dictatorship Sub-Saharan Africa -3.9131508 5 109.6 20.7 3.391146 6360.849 62.77 1975
Albania Parliamentary Democracy Eastern Europe/Soviet Union 9.1340709 3 14.8 NA 5.335035 9646.582 53.31 1920
United Arab Emirates Royal Dictatorship Oil States 3.0752631 5 7.3 5.0 3.929452 56245.478 NA NA
Argentina Presidential Democracy Latin America 2.6751617 2 13.0 5.0 6.550156 18333.995 81.07 1947
Armenia Mixed Democracy Eastern Europe/Soviet Union 5.7160378 4 16.1 6.5 4.562263 6376.268 62.87 1921

and if we specify full_width false (=F), notice the change

#now let's try another 
head(wdat) %>%   kbl(caption = "Table a: World Data")%>% 
        kable_classic( full_width = F)
Table a: World Data
country regime aclpregion fdi fhliberties inf nourish health gdppc turnout womyear
Afghanistan Civilian Dictatorship Eastern Europe/Soviet Union 0.3400968 6 75.1 24.7 9.197723 1629.167 45.83 NA
Angola Civilian Dictatorship Sub-Saharan Africa -3.9131508 5 109.6 20.7 3.391146 6360.849 62.77 1975
Albania Parliamentary Democracy Eastern Europe/Soviet Union 9.1340709 3 14.8 NA 5.335035 9646.582 53.31 1920
United Arab Emirates Royal Dictatorship Oil States 3.0752631 5 7.3 5.0 3.929452 56245.478 NA NA
Argentina Presidential Democracy Latin America 2.6751617 2 13.0 5.0 6.550156 18333.995 81.07 1947
Armenia Mixed Democracy Eastern Europe/Soviet Union 5.7160378 4 16.1 6.5 4.562263 6376.268 62.87 1921

Now let’s specify “hover” and after running the chunk hover your mouse over the table.

#now let's try another 
head(wdat) %>%   kbl(caption = "Table a: World Data")%>% 
        kable_classic("hover", full_width = F) 
Table a: World Data
country regime aclpregion fdi fhliberties inf nourish health gdppc turnout womyear
Afghanistan Civilian Dictatorship Eastern Europe/Soviet Union 0.3400968 6 75.1 24.7 9.197723 1629.167 45.83 NA
Angola Civilian Dictatorship Sub-Saharan Africa -3.9131508 5 109.6 20.7 3.391146 6360.849 62.77 1975
Albania Parliamentary Democracy Eastern Europe/Soviet Union 9.1340709 3 14.8 NA 5.335035 9646.582 53.31 1920
United Arab Emirates Royal Dictatorship Oil States 3.0752631 5 7.3 5.0 3.929452 56245.478 NA NA
Argentina Presidential Democracy Latin America 2.6751617 2 13.0 5.0 6.550156 18333.995 81.07 1947
Armenia Mixed Democracy Eastern Europe/Soviet Union 5.7160378 4 16.1 6.5 4.562263 6376.268 62.87 1921

We can also specify a more informative name for our columns. Let’s try changing the labels using col.names:

#now let's try another 

stdat %>%   kbl(caption = "Table a: States Data",
                  col.names = c("Country", "Regime", "Trump Won", "Percent Women", "Income"))%>% 
        kable_classic("hover", full_width = F)
Table a: States Data
Country Regime Trump Won Percent Women Income
Alabama South 1 75.49 34650
Alaska West 1 79.02 45529
Arizona West 1 84.00 35875
Arkansas South 1 88.52 34014
California West 0 89.94 44481
Colorado West 0 79.57 44088

####graphs

let’s create a simple ggplot graph looking at the variable regime. A bar plot will show the frequency for the variable, that is the y axis will show the count of observations in our dataset for each category of regime.

 world %>% ggplot( aes(regime))+
geom_bar()

We can assign the graph to an object, let’s call it wplot1. And remember we need to add the object in the chunk if we want R to “print” it to our screen. And as a bonus, we can use the option fill to color our bars.

wplot1<-  world %>% ggplot( aes(regime) )+
geom_bar(fill="lightblue")

wplot1

Now let’s add a bar plot of foreign direct investment[fdi] which is the net foreign investment inflows as a percentage of GDP) against regime type. We will first group the data by the variable regime, then create a new variable m that computes the mean of fdi for each regime, then we can ungroup the data again and produce our graph of mean fdi by regime. And remember that each of these steps can be string together in the tidyverse with our pipe operator [%>%].


wdat %>% 
  filter(!is.na(fdi)) %>% # name of the dataset
  group_by(regime) %>%                    # grouping the data by type of regime
  summarize(m = mean(fdi)) %>%            # calculating the mean of fdi
  ungroup() %>%                           # ungroup the data again
  ggplot( aes(x=regime, y=m))+            # set up the graph
geom_col()                                # set geometry of the graph

Now, let’s use the option fill to the aesthetics of our plot to color our plot with each category of regime


wdat %>% 
  filter(!is.na(fdi)) %>% # name of the dataset
  group_by(regime) %>%                    # grouping the data by type of regime
  summarize(m = mean(fdi)) %>%            # calculating the mean of fdi
  ungroup() %>%                           # ungroup the data again
  ggplot( aes(x=regime, y=m, fill=regime))+            # set up the graph
geom_col()                                # set geometry of the graph

Let’s see if our plot is more readable when we rotate our plot, changing the orientation of the bars with the coord_flip function.

wdat %>% 
  filter(!is.na(fdi)) %>% # name of the dataset
  group_by(regime) %>%                    # grouping the data by type of regime
  summarize(m = mean(fdi)) %>%            # calculating the mean of fdi
  ungroup() %>%                           # ungroup the data again
  ggplot( aes(x=regime, y=m,fill=regime)) +   # set up the graph
geom_col() +                              # set geometry of the graph
coord_flip()                               #flip the orientation of the bar 

Lastly, our categories can be reordered so Royal dictatorship is not right next to presidential democracy. We first need to define a new variable reg2 that reorders the levels (categories) of the regime variable.


wdat <- wdat %>% mutate(reg2 = factor(regime,
                  levels = c("Royal Dictatorship",
                              "Military Dictatorship",
                              "Civilian Dictatorship",
                              "Mixed Democracy",
                              "Parliamentary Democracy",
                              "Presidential Democracy")))
wdat %>% 
  filter(!is.na(fdi)) %>% # name of the dataset
  group_by(reg2) %>%                    # grouping the data by type of regime
  summarize(m = mean(fdi)) %>%            # calculating the mean of fdi
  ungroup() %>%                           # ungroup the data again
  ggplot( aes(x=reg2, y=m,fill=reg2)) +   # set up the graph
geom_col() +                              # set geometry of the graph
coord_flip()                               #flip the orientation of the bar 

Now let’s add a ready-to-use theme: theme_minimal(). We also specify the axis from 0 to 10 to fit better the bars, and of course, we should also add more informative labels for our variables


wdat <- wdat %>% mutate(reg2 = factor(regime,
                  levels = c("Royal Dictatorship",
                              "Military Dictatorship",
                              "Civilian Dictatorship",
                              "Mixed Democracy",
                              "Parliamentary Democracy",
                              "Presidential Democracy")))
wdat %>% 
  filter(!is.na(fdi)) %>% # name of the dataset
  group_by(reg2) %>%                    # grouping the data by type of regime
  summarize(m = mean(fdi)) %>%            # calculating the mean of fdi
  ungroup() %>%                           # ungroup the data again
  ggplot( aes(x=reg2, y=m,fill=reg2)) +   # set up the graph
geom_col() +                              # set geometry of the graph
coord_flip(ylim=c(0,10)) +  
  theme_minimal()+
  ylab("Foreign direct investment") +
  xlab("Regime") 

. We need to know the shape of the data first to determine what summaries and models to use. When we look at the distribution of our variables we can learn whether all cases are evenly distributed or whether values cluster closer to the minimum, median, or maximum.

Let’s use two examples of our world dataset. First we will explore the histogram of the variable turnout. And let’s add another layer with a density plot to see the smooth density line. For a great explanation of kernel density I recommend this link: https://mathisonian.github.io/kde/


world %>% ggplot(aes(x = turnout)) + 
  geom_histogram(aes(y = ..density..),
                 colour = 1, fill = "white") +
  geom_density() + 
xlab("Voter Turnout") +
ggtitle("Voter Turnout Is Normally Distributed") 

We can see that the distribution of voting turnout seems normally distributed. Let’s look at the shape of the infant mortality in the world (the number of deaths in each country per 1,000 live births), using the variable inf.


world %>% ggplot(aes(x = inf)) + 
  geom_histogram(aes(y = ..density..),
                 colour = 1, fill = "white") +
  geom_density() + 
xlab("Infant Mortality Rate") +
ggtitle(" Infant Mortality Is Not Normally Distributed") 

We can see from the histogram that many countries have low infant mortality, and, unfortunately,some countries have 30, 60, and even 90 deaths per 1,000 live births.

0.0.2.1 Five view of Univariate data

0.0.2.1.1 Frequency Table

Use a frequency table when the variable is categorical. The frequency table indicates how many cases reside in each category, giving the category’s relative size. Let’s use the freq() command to get a frequency table of the National Election Studies (nes) dataset, and see how many Democrats, Republicans, and Independents are included. First let’s set the describe plot option as False (descr.plot).

nes<-read_csv(file.choose())
New names:Rows: 1178 Columns: 52── Column specification ─────────────────────────
Delimiter: ","
chr (35): follow, turnout12, vote12, meet, ma...
dbl (17): ...1, birthyr, ftobama, ftblack, ft...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Freq(nes$pid3, plot = options(descr.plot = FALSE))
head(nes)

Now, if we set the descr.plot = TRUE, we will get tha frequency table and a histogram.

Freq(nes$pid3, plot = options(descr.plot = TRUE))

Frequency tables help us get a good sense about the categories in a variable to decide if we need to combine or exclude some categories. For instance, if we look at a frequency table of the variable employ from the NES data set, we can see that there are nine categories, or that “temporarily laid off” only has four respondents, so we might want to consider combining it with the “other” category

Freq(nes$employ, plot = options(descr.plot = FALSE))
0.0.2.1.2 Bar Plot

Bar plots are useful to identify categories with the highest frequency and are an easy way to make comparisons across categories. Let’s try creating a bar plot of our employment variables.

ggplot(nes, aes(employ)) +
  geom_bar() 

Now, let’s use additional options to make our plot better, we will add a theme, add a title for our plot [ggtitle], add a title for our y label and remove our title for our x-axis by setting it to ““. We will also use the [coord_flip] to change the orientation of our plot. And we will change the color with the fill option, here we are using”ligthblue”, but feel free to play with other colors [https://r-graph-gallery.com/ggplot2-color.html].

ggplot(nes, aes(employ)) +
  geom_bar(fill = "lightblue") +
  theme_minimal() +
  ggtitle("Barplot of Employment Variable") +
  xlab("") +
  ylab("Number of Respondents") +
  coord_flip()

We can see that a large share of the population is employed full-time.

0.0.2.1.3 Boxplot (or Box-and-Whisker Plot)

Boxplots are useful to plot the distribution of a continuous variable. Let’s use a boxplot to look at the percentage of the voting age population that voted in each country’s last national election. Since we only want to plot one variable, we need specify that x=“ ”. The y variable will be our “turnout” variable from the world dataset.

ggplot(world, aes(x="", turnout)) +
  geom_boxplot() 

We can see the median (the thick line in the middle of the box) is a bit above 60% and the interquartile range (the box) is between 55% and 80%. Now, let’s format the plot a bit more.

ggplot(world, aes(x="", turnout)) +
  geom_boxplot(col="dark green") +
  theme_minimal() +
  ggtitle("Boxplot of Turnout") +
  ylab("Percent Voting in Last Election") +
  xlab("") +
  theme_minimal()

And remember boxplots don’t typically show individual observations, so we will “jitter” the points so that they don’t lie directly on top of each other. We control the size of the points with the [size] option, the fill color with the [fill] option and the outline color with the [colour] option.


ggplot(world, aes(x="", turnout)) +
  geom_boxplot(col="darkgreen") +
  theme_minimal() +
  ggtitle("Boxplot of Turnout") +
  ylab("Percent Voting in Last Election") +
  xlab("") +
  theme_minimal() +
    geom_point(fill = "grey", size=1,
             shape=21, colour="grey",
             position=position_jitter(seed = 1))

And if we want to learn a bonus trick, we can add a geom_text_repel layer where we use the ifelse conditional to only show the label only for Canada or the US. We use the ifelse() function to say that if the label is equal to Canada (CAN) or the US (USA), then label them with the dpicode variable; otherwise leave it blank. We specify “blank” with the two single quotations with nothing in between (‘ ’).


ggplot(world, aes(x="", turnout)) +
  geom_boxplot(col="darkgreen") +
  theme_minimal() +
  ggtitle("Boxplot of Turnout") +
  ylab("Percent Voting in Last Election") +
  xlab("") +
  theme_minimal() +
    geom_point(fill = "grey", size=1,
             shape=21, colour="grey",
             position=position_jitter(seed = 1)) +
     geom_text_repel(aes(label=ifelse(dpicode=="CAN" | dpicode=="USA", as.character(dpicode),'')), col ="grey",
                  position = position_jitter(seed = 1),
                  hjust=1, vjust=2)      

NA
NA

aes(label=ifelse(st==“HI” | st==“MA”, as.character(st),’’), hjust = 0, vjust=-1), show.legend=FALSE)

0.0.2.1.4 Histogram

We use histograms to visualize the frequency distribution of continuous variables. They show the shape of the variable’s distribution, and how disperse the values of the variable are. Each bar represents a range of values ( a bin) and we can define the number of bins that makes better sense for our data. Let’s create a histogram of the years since last regime change for all countries in the World dataset using the variable [durable].

ggplot(world, aes(durable)) +
  geom_histogram() 

Now let’s change the number ofbins to10 with the option [bin], and format the histogram by setting the color, the theme and labelin the variable and the plot.

ggplot(world, aes(durable)) +
  geom_histogram(bins=10, fill = "lightblue") +
  ggtitle("Histogram of the durable Variable") +
  xlab("Years Since Last Significant Regime Change") +
  theme_minimal() 

this is a much more informative histogram, so it’s important to think about the choices of bins. for instance, if we were to only use three bins, the resulting histogram will not be as useful:

ggplot(world, aes(durable)) +
  geom_histogram(bins=3, fill = "lightblue") +
  labs(title = paste("Not Enough Bins")) +
  xlab("Years Since Last Significant Regime Change") +
  ylab("Count")+
  theme_minimal() 

0.0.2.1.5 Stem-and-Leaf Plot

One additional descriptive plot that we have not discussed in class is the stem-and-leaf plot, which also help us see a variable’s distribution when the number of observations is somewhat small.The number in the leaf represents the first decimal.Let’s try a plot of household income for the 50 states, using the variable [inc].

stem(states$inc, scale=1)

  The decimal point is 3 digit(s) to the right of the |

  32 | 235778
  34 | 06769
  36 | 12556389
  38 | 022666
  40 | 1556668
  42 | 51
  44 | 0113575789
  46 | 3
  48 | 
  50 | 50
  52 | 26
  54 | 
  56 | 9

The stem is ordered by 2,000. Each step represents $2,000. In the first row, we see six numbers (six leaves), meaning there are six cases between 32,000 and 34,000. In this case, we don’t know exactly what the numbers are; we just know they’re between 32,000 and 34,000.

0.0.2.2 Bivariate descriptives

Bivariate descriptions illustrate the relationship between two variables, whether they are continuous or categorical. Bivariate plots helps us understand the relationship between the variables and identify potential data errors.

0.0.2.2.1 Scatter Plot

Scatter plots are two dimensional, showing the relationship between two continuous variables. And we can use the color or shape of the point to add a third variable. Let’s create a scatter plot of voting turnout and political knowledge ( percent of the population that recognizes the name of the governor for any given state)

ggplot(states, aes(knowgov, turnout, label = st)) +
  geom_point()

Again, let’s format our scatterplot.

ggplot(states, aes(knowgov, turnout, label = st)) +
  geom_point(col="orange") +
  ggtitle("Turnout Not Related to Political Knowledge") +
  ylab("Voting Turnout") +
  xlab("Political Knowledge") +
  theme_minimal() 

We can quickly tell there is not much of a relationship between political knowledge and voting turnout, as turnout does not systematically increase or decrease with increases in political turnout.

Let’s use infant mortality and the percentage of a state’s population that has a high school degree in our States dataset as an example of a scatterplot showing a fairly strong relationship:

ggplot(states, aes(hsdiploma, infant)) +
  geom_point() 

Now let’s format our plot.

ggplot(states, aes(hsdiploma, infant)) +
  geom_point(col="red") +
  ggtitle("Education Reduces Infant Mortality") +
  ylab("Infant Mortality") +
  xlab("High School Diploma") +
  theme_minimal() 

And to take it up a level, let’s draw a line fit to the plot with the geom_smooth layer and a method = “lm” (for linear fit).

ggplot(states, aes(hsdiploma, infant)) +
  geom_point(col="red") +
  geom_smooth(method = "lm", se = FALSE, col="grey") +
  ggtitle("Same Education, Different Outcome") +
  ylab("Infant Mortality") +
  xlab("High School Diploma") +
  theme_minimal()

After adding this fitted line, we see that there is a negative association between education and infant mortality: as one variable increases (education), the other decreases (infant mortality). It is also a fairly linear relationship. The decline in infant mortality is constant over the range of education.

0.0.2.2.2 Boxplot (Bivariate)

Boxplots can also be useful when looking at the relationship between continuous and categorical variables, if there are not too many categories (between 5 and 10). Let’s look at a boxplot of the feeling thermometer towards science (a variable ranging from 0 to 100 that indicate how survey respondents feel about a particular kind of person or policy).

ggplot(nes, aes(pid7, ftsci)) +
  geom_boxplot(col="cadetblue") +
  theme_minimal() +
  theme(axis.text.x = element_text(size=8, angle=0, vjust=.7)) +
  ggtitle("Partisanship Shapes Attitudes") +
  ylab("Science Thermometer") +
  xlab("Party Identification") +
  coord_flip()

Let’s use the subset() function in our first ggplot() line to filter out the categories “NA” and “Not Sure” from the party identification variable.

ggplot(subset(nes, pid7!="NA" & pid7!="Not sure"), aes(pid7, ftsci)) +
  geom_boxplot(col="cadetblue") +
  theme_minimal() +
  theme(axis.text.x = element_text(size=8, angle=0, vjust=.7)) +
  ggtitle("Partisanship Shapes Attitudes") +
  ylab("Science Thermometer") +
  xlab("Party Identification") +
  coord_flip()

0.0.2.2.3 Mosaic Plot

Mosaic Plot is an intuitive visual representation of cross-tabs (tables that display the association between two categorical variables). Mosaic plots show the breakdown between the categories fo the two variables, but also uses the width of the bars to represent the number of observations in each category of the x axis.

ggplot(nes)+
  geom_mosaic(aes(x = product(gender, pid3),
  fill=gender,na.rm=TRUE))

Let’s try to filtered observations with values “Not sure”, “Other” in a new varialbe pid3_new. and also missing values (using “NA”)



ggplot(subset(nes, pid3 %in% c("Democrat","Republican","Indepent"),  pid3!="NA"), aes(x = gender, y= pid3)) +
geom_mosaic(aes(x = product(gender, pid3),
  fill=gender,na.rm=TRUE)) +
  xlab("") +
  ylab("") +
  ggtitle("Gender by Party Identification") +
  theme_minimal() +
  scale_fill_brewer(palette="Blues") +
  theme(legend.position = "none")

NA
NA
NA

The mosaic plot indicates males make up a slightly larger percentage of Independents than either Republicans or Democrats. We can observe that women are not systematically more conservative than men. And we can look at the width of the columns to observe that a larger percentage of the population identifies with the Democratic party, then Independents, followed by Republicans.

now, let’s create a mosaic plot using variables party identification and whether the respondent worries about terrorism from the NES data.We are first going to create a new variable var2, that excludes observations with values equal “Not asked”.

nes$var2 <- gsub("Not asked", "", as.character(nes$terror_worry))

ggplot(data = subset(nes, pid3!="NA" & var2 !="NA")) +
  geom_mosaic(aes(x = product(var2, pid3), fill=var2, na.rm=TRUE)) +
  xlab("") +
  ylab("") +
  ggtitle("Democrats Are Less Worried About Terrorism") +
  theme_minimal() +
  scale_fill_brewer(palette="Blues") +
  theme(legend.position = "none") 

We can quickly see that Democrats are the least worried about a terrorist attack.

0.0.2.2.4 Cross-Tab

Cross-tab are a commonly used summary of categorical variables. We are going to use it to look at the breakdown of categories using variables party identification and Gender from the NES data.

The cross-tab includes both the count (frequency) and the column percentage. For example, looking at the Total column, we can see there are 577 females in the sample and that they represents 52.5% o of the sample (.525) -add the percents going down the column, they should add up to 100%-.

CrossTable(nes$gender, nes$pid3,
           main="Cross-Tabulation of Gender and Party ID",
           prop.chisq=FALSE)
   Cell Contents 
|-------------------------|
|                       N | 
|           N / Row Total | 
|           N / Col Total | 
|         N / Table Total | 
|-------------------------|

======================================================
         nes$pid3
ns$gn    Dmcrt   Indpn   Nt sr   Other   Rpblc   Total
------------------------------------------------------
Femal      262     167       4      30     148     611
         0.429   0.273   0.007   0.049   0.242   0.519
         0.584   0.445   1.000   0.405   0.536        
         0.222   0.142   0.003   0.025   0.126        
------------------------------------------------------
Male       187     208       0      44     128     567
         0.330   0.367   0.000   0.078   0.226   0.481
         0.416   0.555   0.000   0.595   0.464        
         0.159   0.177   0.000   0.037   0.109        
------------------------------------------------------
Total      449     375       4      74     276    1178
         0.381   0.318   0.003   0.063   0.234        
======================================================

0.0.3 quizz time

0.0.3.1 Let’s create more tables.

Let’s use again kable here to create a table of the dataframe “stdat” we just created with a subset of the states data. The package includes six ready-to-use themes. Let’s try the kable_minimal(), and add it with a pipe operator like in this example: Note: this will follow the same code including in the “Tables” part of this tutorial, so you just need to apply it to the stdat dataset.

# Table using
stdat %>% kbl()%>% 
  kable_minimal()
state region trumpwin percwom inc
Alabama South 1 75.49 34650
Alaska West 1 79.02 45529
Arizona West 1 84.00 35875
Arkansas South 1 88.52 34014
California West 0 89.94 44481
Colorado West 0 79.57 44088

now let’s add a caption to our table: “Table a: States Data”

#now let's add a caption to our table
head(stdat) %>%   kbl(caption = "Table a: States Data")%>% 
        kable_minimal()
Table a: States Data
state region trumpwin percwom inc
Alabama South 1 75.49 34650
Alaska West 1 79.02 45529
Arizona West 1 84.00 35875
Arkansas South 1 88.52 34014
California West 0 89.94 44481
Colorado West 0 79.57 44088
NA
NA

now let’s try another theme, this time the kable_classic [note: Other themes to consides: kable_classic(), kable_paper(), kable_classic_2(),kable_minimal(), kable_material()]

#now let's try another theme
head(stdat) %>%   kbl(caption = "Table a: States Data")%>% 
        kable_minimal()
Table a: States Data
state region trumpwin percwom inc
Alabama South 1 75.49 34650
Alaska West 1 79.02 45529
Arizona West 1 84.00 35875
Arkansas South 1 88.52 34014
California West 0 89.94 44481
Colorado West 0 79.57 44088
NA
NA

and if we specify full_width false (=F), notice the change

head(stdat) %>%   kbl(caption = "Table a: States Data")%>% 
        kable_classic( full_width = F)
Table a: States Data
state region trumpwin percwom inc
Alabama South 1 75.49 34650
Alaska West 1 79.02 45529
Arizona West 1 84.00 35875
Arkansas South 1 88.52 34014
California West 0 89.94 44481
Colorado West 0 79.57 44088
NA
NA

Now let’s specify (“hover”, full_width = F) and after running the chunk hover your mouse over the table.

#let's specify hover
head(stdat) %>%   kbl(caption = "Table a: States Data")%>% 
        kable_classic("hover", full_width = F) 
Table a: States Data
state region trumpwin percwom inc
Alabama South 1 75.49 34650
Alaska West 1 79.02 45529
Arizona West 1 84.00 35875
Arkansas South 1 88.52 34014
California West 0 89.94 44481
Colorado West 0 79.57 44088

We can also specify a more informative name for our columns. Let’s try changing the labels using col.names: col.names = c(“State”, “Region”, “Trump Won”, “Percent Women”, “Income”))

#let's change the labels for our columns
stdat %>%   kbl(caption = "Table a: States Data",
                  col.names = c("State", "Region", "Trump Won", "Percent Women", "Income"))%>% 
        kable_classic("hover", full_width = F)
Table a: States Data
State Region Trump Won Percent Women Income
Alabama South 1 75.49 34650
Alaska West 1 79.02 45529
Arizona West 1 84.00 35875
Arkansas South 1 88.52 34014
California West 0 89.94 44481
Colorado West 0 79.57 44088

0.0.3.2 Time to plot

Using the states dataset, produce four different plots (two univariate -single variable- and two bivariate -two variables-) following the examples in the tutorial.

#your code for plot 1 here
ggplot(states, aes(x="", democrat)) +
  geom_boxplot() 

#your code for plot 2 here
ggplot(states, aes(region)) +
  geom_bar() 

#your code for plot 3 here
ggplot(states)+
  geom_mosaic(aes(x = product(region, weed),
  fill=region,na.rm=TRUE))

#your code for plot 4 here
CrossTable(states$region, states$weed,
           main="Cross-Tabulation of Region and Weed Usage",
           prop.chisq=FALSE)
   Cell Contents 
|-------------------------|
|                       N | 
|           N / Row Total | 
|           N / Col Total | 
|         N / Table Total | 
|-------------------------|

======================================
                 states$weed
states$region        0       1   Total
--------------------------------------
Midwest              7       5      12
                 0.583   0.417   0.240
                 0.333   0.172        
                  0.14    0.10        
--------------------------------------
Northeast            0       9       9
                 0.000   1.000   0.180
                 0.000   0.310        
                  0.00    0.18        
--------------------------------------
South               11       5      16
                 0.688   0.312   0.320
                 0.524   0.172        
                  0.22    0.10        
--------------------------------------
West                 3      10      13
                 0.231   0.769   0.260
                 0.143   0.345        
                  0.06    0.20        
--------------------------------------
Total               21      29      50
                  0.42    0.58        
======================================

0.0.4 Additional notes

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

---
title: "PUAD 5140"
subtitle: "Describing Data"
author: "Jesse Lewis"
date: "November 22, 2022"
output:
  html_notebook:
    toc: yes
    toc_float: yes
    number_sections: yes
    highlight: haddock
    toc_depth: 5
  pdf_document:
    toc: yes
  html_document:
    toc: yes
    toc_float: yes
    theme: cerulean
    highlight: haddock
    toc_depth: 5
---

### First things first. 

Install and load all required packages. 
```{r}
#nstall.packages("devtools")
#nstall.packages("tidyverse")
#nstall.packages("kableExtra")
#nstall.packages("visreg")
#nstall.packages("stargazer")
#nstall.packages("ggrepel")
#nstall.packages("gridExtra")
#nstall.packages("fBasics")
#nstall.packages("DescTools")
#nstall.packages("ggmosaic")
install.packages(descr)


```


```{r}
#install.packages("devtools")
#install.packages("tidyverse")
#ibrary(tidyverse)
#ibrary(kableExtra)
#ibrary(visreg)
#ibrary(stargazer)
#ibrary(ggrepel)
#ibrary(gridExtra)
#ibrary(fBasics)
#lbrary(DescTools)
#ibrary(ggmosaic)
library(descr)

```

### Exploring the data 


Descriptive, summary statistics helps us generate useful descriptions of our dataset, identify data anomalies and help us frame our questions and hypothesis. The aim of this tutorial/quiz is to help you become familiar with useful R functions and packages for descriptive analysis. 

Let's create a subset of the state data. The variable state is the state’s name, region indicates the region,  trumpwin records whether Donald Trump won the state’s electoral college in the 2016 race, percwom is the average woman’s salary relative to men, and ***inc*** records per capita income. Note: remember to install and load your packages. 

```{r}
states<-read_csv(file.choose())
stdat <- head(dplyr::select(states, state, region, trumpwin, percwom, inc))
head(stdat)
```

And a subset of the world data:

```{r}
world<-read_csv(file.choose())
wdat <- world %>% select(country, regime,aclpregion, fdi, fhliberties, inf,nourish, health, gdppc, turnout, womyear)
head(wdat)
```




####Tables


We are going to use the kable() function to create good-looking tables. Take a moment to explore the package documentation using RStudio help: type: ?kableExtra or use the link: https://haozhu233.github.io/kableExtra/awesome_table_in_html.html 

```{r}
?kableExtra
```

Let's use it here to create a formatted table of the dataframe "wdat" we just created with a subset of the world data. The package includes six ready-to-use themes. Let's try the kable_minimal(), and add it with a pipe operator like in this example. Since we are not using any summary stats in our table, we can use the head() function to just print the first rows of our wdat
```{r}

head(wdat) %>% kbl()%>% 
  kable_minimal()
```


now let's add a caption to our table. 

```{r}
#now let's add a caption to our table
head(wdat) %>%   kbl(caption = "Table a: World Data")%>% 
        kable_minimal()
```

now let's try another theme, this time the kable_classic [note: Other themes to consides: kable_classic(), kable_paper(), kable_classic_2(),kable_minimal(), kable_material()]
```{r}
#now let's try another 
head(wdat) %>%   kbl(caption = "Table a: World Data")%>% 
        kable_classic()
```

and if we specify full_width false (=F), notice the change

```{r}
#now let's try another 
head(wdat) %>%   kbl(caption = "Table a: World Data")%>% 
        kable_classic( full_width = F)
```

Now let's specify "hover" and after running the chunk hover your mouse over the table.
```{r}
#now let's try another 
head(wdat) %>%   kbl(caption = "Table a: World Data")%>% 
        kable_classic("hover", full_width = F) 
```

We can also specify a more informative name for our columns. Let's try changing the labels using col.names:

```{r}
#now let's try another 

stdat %>%   kbl(caption = "Table a: States Data",
                  col.names = c("Country", "Regime", "Trump Won", "Percent Women", "Income"))%>% 
        kable_classic("hover", full_width = F)
```


####graphs

let's create a simple ggplot graph looking at the variable regime. A bar plot will show the frequency for the variable, that is the y axis will show the count of observations in our dataset for each category of regime. 

```{r}
 world %>% ggplot( aes(regime))+
geom_bar()

```

We can assign the graph to an object, let's call it wplot1. And remember we need to add the object in the chunk if we want R to "print" it to our screen. And as a bonus, we can use the option fill to color our bars.
```{r}
wplot1<-  world %>% ggplot( aes(regime) )+
geom_bar(fill="lightblue")

wplot1
```

Now let's add a bar plot of foreign direct investment[fdi] which is the net foreign investment inflows as a percentage of GDP) against regime type. We will first group the data by the variable regime, then create a new variable ***m*** that computes the mean of fdi for each regime, then we can ungroup the data again and produce our graph of mean fdi by regime. And remember that each of these steps can be string together in the tidyverse with our pipe operator [%>%].

```{r}

wdat %>% 
  filter(!is.na(fdi)) %>% # name of the dataset
  group_by(regime) %>%                    # grouping the data by type of regime
  summarize(m = mean(fdi)) %>%            # calculating the mean of fdi
  ungroup() %>%                           # ungroup the data again
  ggplot( aes(x=regime, y=m))+            # set up the graph
geom_col()                                # set geometry of the graph
```
Now, let's use the option fill to the aesthetics of our plot to color our plot with each category of regime

```{r}

wdat %>% 
  filter(!is.na(fdi)) %>% # name of the dataset
  group_by(regime) %>%                    # grouping the data by type of regime
  summarize(m = mean(fdi)) %>%            # calculating the mean of fdi
  ungroup() %>%                           # ungroup the data again
  ggplot( aes(x=regime, y=m, fill=regime))+            # set up the graph
geom_col()                                # set geometry of the graph
```

Let's see if our plot is more readable when we rotate our plot, changing the orientation of the bars with the coord_flip function.
```{r}
wdat %>% 
  filter(!is.na(fdi)) %>% # name of the dataset
  group_by(regime) %>%                    # grouping the data by type of regime
  summarize(m = mean(fdi)) %>%            # calculating the mean of fdi
  ungroup() %>%                           # ungroup the data again
  ggplot( aes(x=regime, y=m,fill=regime)) +   # set up the graph
geom_col() +                              # set geometry of the graph
coord_flip()                               #flip the orientation of the bar 
```

Lastly, our categories can be reordered so Royal dictatorship is not right next to presidential democracy. 
We first need to define a new variable reg2 that reorders the levels (categories) of the regime variable.

```{r}

wdat <- wdat %>% mutate(reg2 = factor(regime,
                  levels = c("Royal Dictatorship",
                              "Military Dictatorship",
                              "Civilian Dictatorship",
                              "Mixed Democracy",
                              "Parliamentary Democracy",
                              "Presidential Democracy")))
wdat %>% 
  filter(!is.na(fdi)) %>% # name of the dataset
  group_by(reg2) %>%                    # grouping the data by type of regime
  summarize(m = mean(fdi)) %>%            # calculating the mean of fdi
  ungroup() %>%                           # ungroup the data again
  ggplot( aes(x=reg2, y=m,fill=reg2)) +   # set up the graph
geom_col() +                              # set geometry of the graph
coord_flip()                               #flip the orientation of the bar 

```


Now let's add a ready-to-use theme: theme_minimal(). We also specify the axis from 0 to 10 to fit better the bars, and of course, we should also add more informative labels for our variables

```{r}

wdat <- wdat %>% mutate(reg2 = factor(regime,
                  levels = c("Royal Dictatorship",
                              "Military Dictatorship",
                              "Civilian Dictatorship",
                              "Mixed Democracy",
                              "Parliamentary Democracy",
                              "Presidential Democracy")))
wdat %>% 
  filter(!is.na(fdi)) %>% # name of the dataset
  group_by(reg2) %>%                    # grouping the data by type of regime
  summarize(m = mean(fdi)) %>%            # calculating the mean of fdi
  ungroup() %>%                           # ungroup the data again
  ggplot( aes(x=reg2, y=m,fill=reg2)) +   # set up the graph
geom_col() +                              # set geometry of the graph
coord_flip(ylim=c(0,10)) +  
  theme_minimal()+
  ylab("Foreign direct investment") +
  xlab("Regime") 
```

. We need to know the shape of the data first to determine what summaries and models to use. When we look at the distribution of our variables we can learn whether all cases are evenly distributed or whether values cluster closer to the minimum, median, or maximum. 

Let's use two examples of our world dataset. First we will explore the histogram of the variable turnout. And let's add another layer with a density plot to see the smooth density line. For a great explanation of kernel density I recommend this link: https://mathisonian.github.io/kde/

```{r}

world %>% ggplot(aes(x = turnout)) + 
  geom_histogram(aes(y = ..density..),
                 colour = 1, fill = "white") +
  geom_density() + 
xlab("Voter Turnout") +
ggtitle("Voter Turnout Is Normally Distributed") 

```

We can see that the distribution of voting turnout seems normally distributed. Let's look at the shape of the infant mortality in the world (the number of deaths in each country per 1,000 live births), using the variable inf.

```{r}

world %>% ggplot(aes(x = inf)) + 
  geom_histogram(aes(y = ..density..),
                 colour = 1, fill = "white") +
  geom_density() + 
xlab("Infant Mortality Rate") +
ggtitle(" Infant Mortality Is Not Normally Distributed") 

```
We can see from the histogram that many countries have low infant mortality, and, unfortunately,some countries have 30, 60, and even 90 deaths per 1,000 live births.


    
#### Five view of Univariate data

##### Frequency Table

Use a frequency table when the variable is categorical. The frequency table indicates how many cases reside in each category, giving the category’s relative size. Let's use the freq()  command to get a frequency table of the  National Election Studies (nes) dataset, and see how many Democrats, Republicans, and Independents are included. 
First let's set the describe plot option as False (descr.plot).
```{r}
nes<-read_csv(file.choose())
Freq(nes$pid3, plot = options(descr.plot = FALSE))
head(nes)
```

Now, if we set the descr.plot = TRUE, we will get tha frequency table and a histogram. 
```{r}
Freq(nes$pid3, plot = options(descr.plot = TRUE))
```

Frequency tables help us get a good sense about the categories in a variable to decide if we need to combine or exclude some categories. For instance, if we look at a frequency table of the variable employ from the NES data set, we can see that there are nine categories, or that  “temporarily laid off” only has four respondents, so we might want to consider combining it with the "other" category

```{r}
Freq(nes$employ, plot = options(descr.plot = FALSE))
```

##### Bar Plot

Bar plots are useful to identify categories with the highest frequency and are an easy way to make comparisons across categories. Let's try creating a bar plot of our employment variables.

```{r}
ggplot(nes, aes(employ)) +
  geom_bar() 
```
Now, let's use additional options to make our plot better, we will add a theme, add a title for our plot [ggtitle], add a title for our y label and remove our title for our x-axis by setting it to "". We will also use the [coord_flip] to change the orientation of our plot. And we will change the color with the fill option, here we are using "ligthblue", but feel free to play with other colors [https://r-graph-gallery.com/ggplot2-color.html].

```{r}
ggplot(nes, aes(employ)) +
  geom_bar(fill = "lightblue") +
  theme_minimal() +
  ggtitle("Barplot of Employment Variable") +
  xlab("") +
  ylab("Number of Respondents") +
  coord_flip()

```
We can see that a large share of the population is employed full-time. 


##### Boxplot (or Box-and-Whisker Plot)

Boxplots are useful to plot the distribution of a continuous variable. Let's use a boxplot to look at the percentage of the voting age population that voted in each country’s last national election. Since we only want to plot one variable, we need specify that x=“ ”. The y variable will be our "turnout" variable from the world dataset. 

```{r}
ggplot(world, aes(x="", turnout)) +
  geom_boxplot() 

```

We can see the median (the thick line in the middle of the box) is a bit above 60% and the interquartile range (the box) is between 55% and 80%. Now, let's format the plot a bit more. 


```{r}
ggplot(world, aes(x="", turnout)) +
  geom_boxplot(col="dark green") +
  theme_minimal() +
  ggtitle("Boxplot of Turnout") +
  ylab("Percent Voting in Last Election") +
  xlab("") +
  theme_minimal()

```

And remember boxplots don’t typically show individual observations, so we will  “jitter” the points so that they don’t lie directly on top of each other. We control the size of the points with the [size] option, the fill color with the [fill] option and the outline color with the [colour] option.

```{r}

ggplot(world, aes(x="", turnout)) +
  geom_boxplot(col="darkgreen") +
  theme_minimal() +
  ggtitle("Boxplot of Turnout") +
  ylab("Percent Voting in Last Election") +
  xlab("") +
  theme_minimal() +
    geom_point(fill = "grey", size=1,
             shape=21, colour="grey",
             position=position_jitter(seed = 1))

```

And if we want to learn a bonus trick, we can add a  geom_text_repel layer where we use the ifelse conditional to only show the label only for Canada or the US. We use the ifelse() function to say that if the label is equal to Canada (CAN) or the US (USA), then label them with the dpicode variable; otherwise leave it blank. We specify “blank” with the two single quotations with nothing in between (‘ ’).


```{r}

ggplot(world, aes(x="", turnout)) +
  geom_boxplot(col="darkgreen") +
  theme_minimal() +
  ggtitle("Boxplot of Turnout") +
  ylab("Percent Voting in Last Election") +
  xlab("") +
  theme_minimal() +
    geom_point(fill = "grey", size=1,
             shape=21, colour="grey",
             position=position_jitter(seed = 1)) +
     geom_text_repel(aes(label=ifelse(dpicode=="CAN" | dpicode=="USA", as.character(dpicode),'')), col ="grey",
                  position = position_jitter(seed = 1),
                  hjust=1, vjust=2)      


```

  aes(label=ifelse(st=="HI" | st=="MA", as.character(st),''),
    hjust = 0, vjust=-1), show.legend=FALSE)
    
##### Histogram

We use histograms to visualize the frequency distribution of continuous variables. They show the shape of the variable’s distribution, and how disperse the values of the variable are. Each bar represents a range of values ( a bin) and we can define the number of bins that makes better sense for our data. Let's create a histogram of the years since last regime change for all countries in the World dataset using the variable [durable].

```{r}
ggplot(world, aes(durable)) +
  geom_histogram() 
```
Now let's  change the number ofbins to10 with the option [bin], and format the histogram by setting the color, the theme and labelin the variable and the plot.
```{r}
ggplot(world, aes(durable)) +
  geom_histogram(bins=10, fill = "lightblue") +
  ggtitle("Histogram of the durable Variable") +
  xlab("Years Since Last Significant Regime Change") +
  theme_minimal() 

```
this is a much more informative histogram, so it's important to think about the choices of bins. for instance, if we were to only use three bins, the resulting histogram will not be as useful:

```{r}
ggplot(world, aes(durable)) +
  geom_histogram(bins=3, fill = "lightblue") +
  labs(title = paste("Not Enough Bins")) +
  xlab("Years Since Last Significant Regime Change") +
  ylab("Count")+
  theme_minimal() 

```


##### Stem-and-Leaf Plot

One additional descriptive plot that we have not discussed in class is the stem-and-leaf plot, which also help us see a variable’s distribution when the  number of observations is somewhat small.The number in the leaf represents the first decimal.Let's try a plot of household income for the 50 states, using the variable [inc].


```{r}
stem(states$inc, scale=1)

```


The stem is ordered by 2,000. Each step represents $2,000. In the first row, we see six numbers (six leaves), meaning there are six cases between 32,000 and 34,000. In this case, we don’t know exactly what the numbers are; we just know they’re between 32,000 and 34,000. 



#### Bivariate descriptives

Bivariate descriptions illustrate the relationship between two variables, whether they are continuous or categorical. Bivariate plots helps us understand the relationship between the variables and identify potential data errors. 

##### Scatter Plot

Scatter plots are two dimensional, showing the relationship between two continuous variables. And we can use the color or shape of the point to add a third variable. Let's create a scatter plot of voting turnout and political knowledge ( percent of the population that recognizes the name of the governor for any given state) 



```{r}
ggplot(states, aes(knowgov, turnout, label = st)) +
  geom_point()
```

Again, let's format our scatterplot. 
```{r}
ggplot(states, aes(knowgov, turnout, label = st)) +
  geom_point(col="orange") +
  ggtitle("Turnout Not Related to Political Knowledge") +
  ylab("Voting Turnout") +
  xlab("Political Knowledge") +
  theme_minimal() 

```
We can quickly tell there is not much of a relationship between political knowledge and voting turnout, as turnout does not systematically increase or decrease with increases in political turnout. 


Let's use infant mortality and the percentage of a state’s population that has a high school degree in our States dataset as an example of a scatterplot showing a fairly strong relationship:

```{r}
ggplot(states, aes(hsdiploma, infant)) +
  geom_point() 
```
Now let's format our plot. 
```{r}
ggplot(states, aes(hsdiploma, infant)) +
  geom_point(col="red") +
  ggtitle("Education Reduces Infant Mortality") +
  ylab("Infant Mortality") +
  xlab("High School Diploma") +
  theme_minimal() 

```

And to take it up a level, let's draw a line fit to the plot with the geom_smooth layer and a method = "lm" (for linear fit).

```{r}
ggplot(states, aes(hsdiploma, infant)) +
  geom_point(col="red") +
  geom_smooth(method = "lm", se = FALSE, col="grey") +
  ggtitle("Same Education, Different Outcome") +
  ylab("Infant Mortality") +
  xlab("High School Diploma") +
  theme_minimal()
```
 After adding this fitted line, we see that there is a negative association between education and infant mortality: as one variable increases (education), the other decreases (infant mortality). It is also a fairly linear relationship. The decline in infant mortality is constant over the range of education.


##### Boxplot (Bivariate)

Boxplots can also be useful when looking at the relationship between continuous and categorical variables, if there are not too many categories (between 5 and 10). Let's look at a boxplot of the feeling thermometer towards science (a variable ranging from 0 to 100 that indicate how survey respondents feel about a particular kind of person or policy).  


```{r}
ggplot(nes, aes(pid7, ftsci)) +
  geom_boxplot(col="cadetblue") +
  theme_minimal() +
  theme(axis.text.x = element_text(size=8, angle=0, vjust=.7)) +
  ggtitle("Partisanship Shapes Attitudes") +
  ylab("Science Thermometer") +
  xlab("Party Identification") +
  coord_flip()

```

Let's use the subset() function in our first ggplot() line to filter out the categories "NA" and "Not Sure" from the party identification variable. 
```{r}
ggplot(subset(nes, pid7!="NA" & pid7!="Not sure"), aes(pid7, ftsci)) +
  geom_boxplot(col="cadetblue") +
  theme_minimal() +
  theme(axis.text.x = element_text(size=8, angle=0, vjust=.7)) +
  ggtitle("Partisanship Shapes Attitudes") +
  ylab("Science Thermometer") +
  xlab("Party Identification") +
  coord_flip()

```
##### Mosaic Plot

Mosaic Plot is an intuitive visual representation of cross-tabs (tables that display the association between two categorical variables).  Mosaic plots show the breakdown between the categories fo the two variables, but also uses the width of the bars to represent the number of observations in each category of the  x axis.

```{r}
ggplot(nes)+
  geom_mosaic(aes(x = product(gender, pid3),
  fill=gender,na.rm=TRUE))
```
Let's try to filtered observations with values "Not sure", "Other" in a new varialbe pid3_new. and also missing values (using  "NA")
```{r}


ggplot(subset(nes, pid3 %in% c("Democrat","Republican","Indepent"),  pid3!="NA"), aes(x = gender, y= pid3)) +
geom_mosaic(aes(x = product(gender, pid3),
  fill=gender,na.rm=TRUE)) +
  xlab("") +
  ylab("") +
  ggtitle("Gender by Party Identification") +
  theme_minimal() +
  scale_fill_brewer(palette="Blues") +
  theme(legend.position = "none")


  
```

The mosaic plot indicates males make up a slightly larger percentage of Independents than either Republicans or Democrats.  We can observe  that women are not systematically more conservative than men. And we can look at the width of the columns to observe that a larger percentage of the population identifies with the Democratic party, then Independents, followed by Republicans.

 now, let's create a mosaic plot using variables  party identification and whether the respondent worries about terrorism from the NES data.We are first going to create a new variable var2, that excludes observations with values equal "Not asked".

```{r}
nes$var2 <- gsub("Not asked", "", as.character(nes$terror_worry))

ggplot(data = subset(nes, pid3!="NA" & var2 !="NA")) +
  geom_mosaic(aes(x = product(var2, pid3), fill=var2, na.rm=TRUE)) +
  xlab("") +
  ylab("") +
  ggtitle("Democrats Are Less Worried About Terrorism") +
  theme_minimal() +
  scale_fill_brewer(palette="Blues") +
  theme(legend.position = "none") 

```

We can quickly see that Democrats are the least worried about a terrorist attack.


##### Cross-Tab

Cross-tab are a commonly used summary of categorical variables. We are going to use it to look at the breakdown of categories using variables  party identification and Gender from the NES data.


The cross-tab includes both the count (frequency) and the column percentage. For example, looking at the Total column, we can see there are 577 females in the sample and that they represents 52.5% o of the sample (.525) -add the percents going down the column, they should add up to 100%-.


```{r}
CrossTable(nes$gender, nes$pid3,
           main="Cross-Tabulation of Gender and Party ID",
           prop.chisq=FALSE)
```

### quizz time 

#### Let's create more tables. 



Let's use again kable here to create a table of the dataframe "stdat" we just created with a subset of the states data. The package includes six ready-to-use themes. Let's try the kable_minimal(), and add it with a pipe operator like in this example: Note: this will follow the same code including in the "Tables" part of this tutorial, so you just need to apply it to the stdat dataset. 

```{r}
# Table using
stdat %>% kbl()%>% 
  kable_minimal()
```
now let's add a caption to our table: "Table a: States Data"

```{r}
#now let's add a caption to our table
head(stdat) %>%   kbl(caption = "Table a: States Data")%>% 
        kable_minimal()


```

now let's try another theme, this time the kable_classic [note: Other themes to consides: kable_classic(), kable_paper(), kable_classic_2(),kable_minimal(), kable_material()]
```{r}
#now let's try another theme
head(stdat) %>%   kbl(caption = "Table a: States Data")%>% 
        kable_minimal()


```

and if we specify full_width false (=F), notice the change

```{r}
head(stdat) %>%   kbl(caption = "Table a: States Data")%>% 
        kable_classic( full_width = F)


```

Now let's specify ("hover", full_width = F) and after running the chunk hover your mouse over the table.
```{r}
#let's specify hover
head(stdat) %>%   kbl(caption = "Table a: States Data")%>% 
        kable_classic("hover", full_width = F) 
```

We can also specify a more informative name for our columns. Let's try changing the labels using col.names:  col.names = c("State", "Region", "Trump Won", "Percent Women", "Income"))

```{r}
#let's change the labels for our columns
stdat %>%   kbl(caption = "Table a: States Data",
                  col.names = c("State", "Region", "Trump Won", "Percent Women", "Income"))%>% 
        kable_classic("hover", full_width = F)
```



#### Time to plot 

Using the states dataset, produce four different plots (two univariate -single variable- and two bivariate -two variables-) following the examples in the tutorial. 

```{r}
#your code for plot 1 here
ggplot(states, aes(x="", democrat)) +
  geom_boxplot() 

```



```{r}
#your code for plot 2 here
ggplot(states, aes(region)) +
  geom_bar() 

```

```{r}
#your code for plot 3 here
ggplot(states)+
  geom_mosaic(aes(x = product(region, weed),
  fill=region,na.rm=TRUE))

```


```{r}
#your code for plot 4 here
CrossTable(states$region, states$weed,
           main="Cross-Tabulation of Region and Weed Usage",
           prop.chisq=FALSE)
```




### Additional notes


Add a new chunk by clicking the *Insert Chunk* button on the toolbar or by pressing *Cmd+Option+I*.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the *Preview* button or press *Cmd+Shift+K* to preview the HTML file). 

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike *Knit*, *Preview* does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

