OpenStatistics Lab #1: Introduction to Data
Question 2: Generating a Chloropleth of Real GDP Contribution Per State, 2013
Question 3:
Question 4:Brief Description of Final Project

OpenStatistics Lab #1: Introduction to Data

Exercise 1: How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

Answer:

There are 20,000 cases in the subset examined for this lab, with nine variables included for each case.

The data types are:

genhlth: categorical
exerany: binary
helthplan: binary
smoke100: binary
height:continuous
weight: continuous
wtdesire: continuous
age: continuous
gender: binary

Exercise 2: Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

Summary of cohort height:

summary(cdc$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00

Summary of cohort age:

summary(cdc$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00

Interquartile Range, Height:

iqr_height=70-64
iqr_height

## [1] 6

Answer: 6

Interquartile Range, Age:

iqr_age=57-31
iqr_age

## [1] 26

Answer: 26

Relative Frequency Distribution, Gender:

gender=table(cdc$gender)
barplot(gender/20000)

Relative Frequency Distribution, Age:

exerany=table(cdc$exerany)
barplot(exerany/20000)

Males in the Sample:

table(cdc$gender)

## 
##     m     f 
##  9569 10431

Answer: 9569

Proportion in Excellent Health:

table(cdc$genhlth)[1]/20000

## excellent 
##   0.23285

Answer: 23.285%

Exercise 3: What does the mosaic plot reveal about smoking habits and gender?

table(cdc$gender,cdc$smoke100)

##    
##        0    1
##   m 4547 5022
##   f 6012 4419

mosaicplot(table(cdc$gender,cdc$smoke100),main="Mosaic Plot of Gender vs. smoke100")
title(ylab="Respondent Has Smoked More Than 100 Cigarettes in Lifetime", xlab="Respondent Gender")

Answer: The mosaic reveals that there is a difference in long-term smoking habits between men and women in the sample, namely that more men smoke 100 or more cigarettes in their lifetimes than women. The proportion of men responding as not having smoked 100 or more cigarettes in their lifetime is 4547/9569, or approximately 47.52%. By contrast, the proportion of women responding as not having smoked 100 or more cigarettes in their lifetime is 6012/10431, or approximately 57.64%.

Exercise 4: Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.

The R Command:

under23_and_smoke = subset(cdc, smoke100==1 & age<23)

Exercise 5: What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

The boxplot created by the command:

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)

organizes respondents’ BMI’s by their category of response for the variable “genhlth”. Thus, the boxplot shows the relationship between a respondent’s BMI and their assesment of their overall health.

In the boxplots below, we examine the relationship between a respondent’s BMI and whether or not they have exercised at all in the past month. It makes sense that individuals who have exercised over the past month may have lower BMI’s on average than those who have not.

We generate the R code:

bmi = (cdc$weight/cdc$height^2)*703
boxplot(bmi ~ cdc$exerany,main="Relationship Between BMI and Monthly Exercise",xlab="Whether or Not Respondent Has Exercised in Past Month",ylab="Respondent BMI")

As suspected, the boxplots confirm that those who report not having exercised in the past month have higher BMI’s in general. This means that they tend to posess more weight per unit height on average.

On Your Own:

1. Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

plot(cdc$weight, cdc$wtdesire,xlab="Weight",ylab="Desired Weight",main="Plot of Respondents' Actual Weights vs. Desired Weights")

Looking at the scatterplot, It appears that there is a non-linear relationship between an individuals actual weight and their desired weight. the data seems to fit better to an asymptotic logarithmic form. In other words, it appears that the heavier an individual is, the larger the disparity between their actual and desired weights.

2. Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.

We generate the R Code:

wdiff = abs(cdc$wtdesire-cdc$weight)

3. What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?

wdiff is a continuous variable. If its value is 0, this means that the respondent’s actual weight is the same as their desired weight. If wdiff is positive or negative, this means that the respondent is below or above their desired weight, respectively. In the code for the above column, however, wdiff is the absolute value of all differences, eliminating negative values.

4. Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?

We begin by summarizing and plotting a histogram of the data overall, using the hist() function:

summary(wdiff)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00   10.00   17.11   25.00  500.00

hist(wdiff)

Because a very small proportion of the data is a difference of more than 100 pounds, these observations will be removed so as not to skew the presentation of the histogram. This is done by generating a subset of the initial wdiff data:

wdiff_no_outliers=subset(wdiff, wdiff<101)

Examining the new, cleaned data, we get a clearer picture of the majority of respondents:

summary(wdiff_no_outliers)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00   10.00   15.99   23.00  100.00

hist(wdiff_no_outliers)
abline(v=10,col='red')
abline (v=23, col='red')

The histogram generated in the above code includes two vertical red lines identifying the location of the median and the 3rd quartile. It reveals that most individuals are generally close to their target weight. The median individual is only 10 pounds off from their desired weight. Furthermore, 50% of respondents are within 10 pounds of their desired weight. 75% of respondents are within 23 pounds of their desired weight.

5. Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

First, we create a subset of the cdc data which has had the same outliers removed as in question #4:

cdc_no_outlier=subset(cdc, abs(cdc$wtdesire-cdc$weight)<101)

Then, we create two subsets of this subset: one for males, and another for females

cdc_no_outlier_male=subset(cdc_no_outlier, gender=='m')
cdc_no_outlier_female=subset(cdc_no_outlier, gender=='f')

We use the two gender-differentiated subsets to create summary statistics of weight differences for males and females, respectively:

summary(abs(cdc_no_outlier_male$wtdesire-cdc_no_outlier_male$weight))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00   10.00   13.96   20.00  100.00

summary(abs(cdc_no_outlier_female$wtdesire-cdc_no_outlier_female$weight))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    2.00   10.00   17.87   25.00  100.00

We also generate a side-by-side boxplot of the summary statistics:

boxplot(abs(cdc_no_outlier$wtdesire-cdc_no_outlier$weight) ~ cdc_no_outlier$gender, main="Difference Between Actual and Desired Weight, by Gender", xlab="Gender",ylab="Difference Between Actual and Desired Weight")

The data and plots indicate that in general, men are closer to their desired weight than women. The larger interquartile range, as well as the higher 1st and 3rd quartile values of the boxplot for females relative to the boxplot for males suggests that disparity between actual and desired weight among females is more dispersd, over a higher bracketed range of values than it is for men. Additionally, outlier values in the female subset are more tightly distrubuted at higher values of weight disparity than in the male subset.

6. Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.

First, the mean weight and the standard deviation of weights in the original 20,000 observation data set are calculated and stored as variables wmean and wsd, respectively:

wmean = mean(cdc$weight)
wsd = sd(cdc$weight)
wmean

## [1] 169.683

wsd

## [1] 40.08097

Then, a subset of the full data set is constructed that contains only those entries for which weight falls within one standard deviation of the mean:

cdc_one_sd=subset(cdc, cdc$weight>(wmean-wsd) & cdc$weight<wmean+wsd)

Finally, the number of rows in this subset is divided by the number of rows in the original data set to determine the proportion of respondents with weights falling within one standard deviation of the mean:

nrow(cdc_one_sd)/nrow(cdc)

## [1] 0.7076

Therefore, 70.76% of respondents’ weights fall within one standard deviation of the mean weight.

Question 2: Generating a Chloropleth of Real GDP Contribution Per State, 2013

For this question, I will be graphing each state’s percent contribution to national GDP in the year 2013. Through this chloropleth, I hope to capture an up-to-date image of which states contributed the most (and least) to real GDP by the end of 2013.

First, we install all necessary packages:

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.1.2

library(maptools)

## Warning: package 'maptools' was built under R version 3.1.2

## Loading required package: sp

## Warning: package 'sp' was built under R version 3.1.2

## Checking rgeos availability: FALSE
##      Note: when rgeos is not available, polygon geometry     computations in maptools depend on gpclib,
##      which has a restricted licence. It is disabled by default;
##      to enable gpclib, type gpclibPermit()

library(maps)

## Warning: package 'maps' was built under R version 3.1.2

library(RColorBrewer)

## Warning: package 'RColorBrewer' was built under R version 3.1.2

library(ggmap)

## Warning: package 'ggmap' was built under R version 3.1.2

library(scales)

## Warning: package 'scales' was built under R version 3.1.2

Then, we import a shapefile for the USA using map_data:

states_map=map_data("state")

Then, we import the data we will be using to fill the chloropleth, which in this case is each state’s percentage contribution to national GDP in the year 2013. We import a csv file using the read.csv() function, and create a new data frame from it called “gdp”. We need to clean the data so that it matches the shapefile data. So, we switch all state names to lower case using tolower():

CSV_Darby <- read.csv("C:/Users/Public/Desktop/CSV_Darby.csv")
gdp=CSV_Darby
gdp[,1]=tolower(gdp[,1])
head(gdp)

##           State Current.Dollar.GDP
## 1  connecticut                 1.5
## 2         maine                0.3
## 3 massachusetts                2.7
## 4 new hampshire                0.4
## 5  rhode island                0.3
## 6       vermont                0.2

Using names() on the data frame “gdp”, we change the column header for “State” to “state”, so that it will work when passed to the map_id argument in ggplot(). Then, we construct a plot:

names(gdp)[1]="state"
gdp_map=ggplot(gdp, aes(map_id = state)) +
  geom_map(aes(fill = Current.Dollar.GDP), map = states_map, color="black", size=0.5) +
  expand_limits(x = states_map$long, y = states_map$lat) 

gdp_map

The resulting map has a color scheme that is difficult to read, as well as bad labeling and an odd scale. So we will change this bit by bit.

First,we fix the ratio of x-to-y in the display:

gdp_map=gdp_map+coord_fixed(ratio = 1.2)

gdp_map

Then, we add a more appealing color scheme–I thought green and yellow befit the greenback and gold origins of GDP. Furthermore, to flip the gradient to read from lowest to highest, and also to add in more gradation of the legend, we make use of pretty_breaks() from the scales package:

gdp_map=gdp_map+scale_fill_distiller(palette="YlGn", breaks=pretty_breaks(n=10))+guides(fill = guide_legend(reverse = TRUE))

gdp_map

We should probably add a title to the graph:

gdp_map=gdp_map+labs(title="Percentage Contribution to National GDP by State, 2013", fill="")

gdp_map

Finally, we want to remove the grid and axes labels, using theme_nothing() from the ggmap package:

gdp_map=gdp_map+theme_nothing(legend=TRUE)

gdp_map

Question 3:

We begin by importing the dataset:

Senate <- read.csv("C:/Users/Public/Desktop/Senate.csv")

Then, we extract a subset of it containing only elections in which an incumbent faced a challenger, throwing away those obervations in which there was either no incumbent or no challenger to the incumbent:

elections = subset(Senate, (incumb==1 | incumb==-1) & (0<vote & vote<100))

At this point we must also convert the data type of some variables into class numeric:

elections[,36]=as.numeric(as.character(elections[,36]))
elections[,34]=as.numeric(as.character(elections[,34]))

Finally, we wish to remove “FALSE” as a possible level for the variable qual_differential:

elections$qual_differential=droplevels(elections$qual_differential)

We see that in the restricted dataset, we have 424 elections matching the criterion. These will be used to construct a boxplot, in which incumbent-party vote shares are graphed against the difference between incumbent quality and challenger quality, as measured by the difference between variables “dcandqual” or “rcandqual” in John Sides’ dataset. By Sides’ construct, incumbents are automatically awarded the maximum quality score of 5, and so the worst case scenario is a differential of 1 with their challenger (meaning the challenger also has the maximum challenger quality score of 4).

In order to facilitate the graph, a few variables were generated within excel:

incumbent_win: whether the incumbent candidate won his/her bid for re-election
vote_incumbent_party: what vote share went to the incumbent candidate’s party
qual_differential: The difference in quality between the incumbent candidate and his/her challenger (always greater than or equal to 1)

While I spent several hours banging my head against a wall trying to generate these variables with conditional for-loops in RStudio, I eventually gave up on learning the syntax and prepped the data on the front-end in Excel instead, with nested “if”" conditions. These can be provided if desired.

Now, we generate the initial boxplot:

boxplot(elections$vote_incumbent_party~elections$qual_differential, xlab="Number of Quality Points by Which Incumbent Exceeds Challenger", ylab="Percent of Vote Received by Incumbent's Party", main="Boxplots: Percent of Vote Received by Incumbent vs. \nDiscrepancy in Incumbent and Challenger Quality")
axis(side=2,at=seq(min(elections$vote_incumbent_party),max(elections$vote_incumbent_party),by=5))

We also generate a data frame subset of “elections” including only those elections in which the incumbent lost. For this subset of election races, we make a frequency distribution of quality_differential:

losers=subset(elections,incumbent_win==0)
losers$qual_differential=as.numeric(as.character(losers$qual_differential))
hist(losers$qual_differential,col='grey', breaks=c(0.5,1.5,2.5,3.5,4.5,5.5), xlab="Number of Quality Points by Which Incumbent Exceeds Challenger", ylab="Number of Losing Incumbent Candidates",main="Distribution of Losing Incumbents by Quality Differential")
axis(side=2,at=seq(0,25,by=5))

Observing these two graphs, an answer to the original question seems apparent. As the boxplots show, an incumbent’s percent share of the vote increases if they exceed the quality of their opponent by four or five points. By comparison, the median vote share of incumbents who only exceed their oppoenent’s quality by one to three points is dramatically lower. This suggests that the quality of an incumbent’s challenger in a given election can have a significant impact on the incumbent’s vote totals.

This result is corroborated by the histogram, which confirms that while all incumbents are capable of losing, a disproportionate amount of losing incumbents failed in elections where they only exceeded their opponent’s quality by one or two points. This explains the positive skew in the histogram. With more a larger number of quality bins than the five that Sides provides, we might well see a smoother histogram with a more clear positive skew than the current data affords us.

Question 4:Brief Description of Final Project

For my final project, I would ideally like to go beyond parametric modelling methods, since I have already been acquainted with these in econometrics. Rather, I would like to try to use non-parametric methods to create forecasting models for binary classification problems. While the range of non-parametric methods I could employ to solve binary classification problems is large, I intend to focus on one particular class of models called “support vector machines” since I am already underway on learning about these machines in another class. Essentially, this is a classification technique whereby data points are mapped onto progressively higher-dimension spaces until a hyperplane can be constructed that linearly (or non-linearly) separates the points. From there, the classification problem becomes a matter of mapping untrained data points into the high-dimensional space and observing which side of the hyperplane they fall on. There are a number of binary classification problems in politics that might be relevant to consider:

Will a given candidate will win or lose a specific election?
Will a given actor in the international scene will pursue a hawk or dove strategy with regards to trade, war, etc?

While I am not yet decided on which classification problem to tackle, the right choice will probably be something where classification is not intuitively obvious, but also a situation where it is not so complex as to elude prediction by a model.

With regards to data, I anticipate that most of my needs will be political or economic in nature. On the former subject, the Vital Statistics on Congress via Brookings will probably be an invaluable resource–I have perused the site in a cursory way, and it is evident that they have a good depth of data on politicians’ demographics, as well as a number of other facts. On the subject of economics, the World Bank database is useful, as are two resources that can be procured through GW Libraries: The Econlit Database, and the IFS Database (International Financial Statistics). Outside of GW’s Library Resources, I will also be able to get economic data from the Federal Reserve Banks, as well as the Bureau of Economic Analysis (BEA).

John Northington Visualization Assignment #1