Computer Assignment 2: Frequency Distributions

1. Load Libraries, Set Your Working Directory, & Load Data

Load Libraries:

library(dplyr)         # for manipulating data
library(ggplot2)       # for making graphs
library(knitr)         # for nicer table formatting
library(summarytools)  # for frequency distribution tables

Set your working directory, where the folder “Datasets” is located:

setwd("C:/Users/chenk/OneDrive/Documents/Spring 2020/PMAP 4041/Datasets/Class3")

Load the gss98 data set from the file “Datasets/gss98.RData” into R using load(file = "Datasets/gss98.RData") command. Before you run the command, make sure you have set the working directory correctly (folder “Datasets” should be in your working directory).

load("C:/Users/chenk/OneDrive/Documents/Spring 2020/PMAP 4041/Datasets/Class3/gss98.Rdata")

2. Interpreting Frequency Distributions: `RELIG`

Generate a frequency distribution for RELIG using summarytools::freq(gss98$RELIG) command:

freq(gss98$RELIG)

## Frequencies  
## gss98$RELIG  
## Type: Factor  
## 
##                    Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ---------------- ------ --------- -------------- --------- --------------
##         Catholic    250     25.30          25.30     25.00          25.00
##           Jewish     20      2.02          27.33      2.00          27.00
##             None    144     14.57          41.90     14.40          41.40
##            Other     39      3.95          45.85      3.90          45.30
##       Protestant    535     54.15         100.00     53.50          98.80
##             <NA>     12                               1.20         100.00
##            Total   1000    100.00         100.00    100.00         100.00

Generate a bar chart for RELIG using gss98 %>% ggplot() + geom_bar( aes(x = RELIG), fill = "darkred" ) (pick your own color for the chart):

gss98 %>% ggplot() + geom_bar( aes(x = RELIG), fill = "purple" ) + labs(title = "Bar plot for RELIG")

QUESTIONS:

A. How many people in this data set are Protestants? Catholics? Jews?

There are 535 Protestants, 250 Catholics and 20 Jewish people.

B. What percentage of all respondents have no religion? What proportion have no religion? How were both numbers calculated?

Of the valid respondents, there are 14.57% of people that identify as not having a religion and 14.40% of total amount of people, including the respondents that did not respond, that do not have a religion. 

The first set of respondents includes respondents that did not identify as any of the choses listed (144/988 = .1457) and the second set of respondents examines all the respondents including the ones that did not identify as any of the religions listed above (144/1000 = .1440).

C. What advantage(s) and disadvantage(s) do you see to presenting a bar chart in place of a frequency table?

Advantages: the graph has a really cool color. The graph shows the frequency difference in a readable format which shows how much of a difference people believing in the Protestant religion has over the rest of the other choices. 

Disadvantage: the graph itself is fine, however sorting the choices by a highest to lowest range would enhance the readablility of the graph.

3. Interpreting Frequency Distribution: `FEPRESCH`

Generate a frequency distribution for FEPRESCH using freq(DATASET_NAME$VARIABLE_NAME, round.digits = 1) command (hint: replace DATASET_NAME$VARIABLE_NAME with gss98$FEPRESCH as in the previous example):

freq(gss98$FEPRESCH, round.digits = 1)

## Frequencies  
## gss98$FEPRESCH  
## Type: Factor  
## 
##                         Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## --------------------- ------ --------- -------------- --------- --------------
##        Strongly agree     45       7.0            7.0       4.5            4.5
##                 Agree    215      33.6           40.7      21.5           26.0
##              Disagree    324      50.7           91.4      32.4           58.4
##       Strong disagree     55       8.6          100.0       5.5           63.9
##                  <NA>    361                               36.1          100.0
##                 Total   1000     100.0          100.0     100.0          100.0

Generate a frequency graph for FEPRESCH (hint: use geom_bar() as in the previous example). Pick your own color and don’t forget the title:

gss98 %>% ggplot() + geom_bar( aes( x = FEPRESCH), fill = "DARKRED") + labs(title = "FEPRESCH")

QUESTIONS:

A. Use the the codebook for the survey to find the exact question wording for the variable FEPRESCH. Copy it into your answer (You can cut and paste.)

A preschool child is likely to suffer if his or her mother works.

B. How many people in this data set strongly agree with this statement? What percentage of all respondents strongly agree with this statement?

45 people in this study strongly agree and 4.5% of respondents agree with the statement.

C. What percentage of the respondents who gave valid responses strongly agree with this statement? How was this number calculated? Why is this answer different from that in question 3B? Which percentage is most meaningful in this case - the “percent” or the “valid percent”? Why?

7% of the respondents who gave valid responses strongly agree. This number was derived from people that strongly agree and all the total number of valid reponses (45/639).

This answer is different because the valid reponses excludes people that did not respond, choose none of the chooses above or the data is missing. 

When the dataset includes that many people that chose to not select one of the four response it is better for the study to use the valid responses percentage since it gives a better sample of those the chose to respond with valid answers.

D. How many missing cases are there?

361 were missing cases.

E. What does the 40.7 in the “Cum Percent” column mean? What is the absolute frequency who agreed or strongly agreed? What percentage disagreed or strongly disagreed? What is the absolute frequency who disagreed or strongly disagreed? (Show your work.)

The 40.7% refers to the cumlative percentage of those with valid responses which is derived from the 7% from strongly agree and the 33.6% of those that agree. 

So one could say the 40.7% of the study could say something like " 40.7% of valid respondents agree or strongly agree with "FEPRESCH". 

The absolute frequency of respondents 260 reponsents with 45 respondents strongly agreeing and 215 respondents agreeing and 37.9% ofrespondents have either said strongly disagree or disagree. 

The absolute frequency of the study shows that 379 total respondents either strongly disagreed (55) and disagreed (324),

F. Interpret the bar plot for the variable FEPRESCH. Why did I ask you to plot a bar chart and not a histogram for this variable?

A histogram is best used when trying to describe numerical data when trying to display continuous numerical datasets such as grades or heights.

Compared to a bar graph which is best used in this case to represent categorical data, which is discrete and counts the number of responses in the study.

4. Variable Types

QUESTION:

A. Are relig and fepresch nominal level, ordinal level, or interval level variables? How do you know? Write the names of at least two more of each type of variable in the data set.

Religion is a nomial variable in which there is no order in how the variable is shown. Fepresch is ordinal variable in which the response of variable has a order. Sex is a nomial variable in which there is not order in how it is organized. Income is a ordinal variable in which there is an order in how the varible is formed.

5. Comparisons

Have R produce frequency distributions for several variables having to do with confidence in U.S. institutions: CONCLERG, CONEDUC, CONFED, CONJUDGE, CONLEGIS, and CONPRESS.

freq(gss98$CONCLERG)

## Frequencies  
## gss98$CONCLERG  
## Type: Factor  
## 
##                          Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ---------------------- ------ --------- -------------- --------- --------------
##       great confidence    196     29.74          29.74     19.60          19.60
##        some confidence    335     50.83          80.58     33.50          53.10
##       Hardly confidnce    128     19.42         100.00     12.80          65.90
##                   <NA>    341                              34.10         100.00
##                  Total   1000    100.00         100.00    100.00         100.00

freq(gss98$CONEDUC)

## Frequencies  
## gss98$CONEDUC  
## Type: Factor  
## 
##                          Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ---------------------- ------ --------- -------------- --------- --------------
##       great confidence    193     28.59          28.59     19.30          19.30
##        some confidence    381     56.44          85.04     38.10          57.40
##       Hardly confidnce    101     14.96         100.00     10.10          67.50
##                   <NA>    325                              32.50         100.00
##                  Total   1000    100.00         100.00    100.00         100.00

freq(gss98$CONFED)

## Frequencies  
## gss98$CONFED  
## Type: Factor  
## 
##                          Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ---------------------- ------ --------- -------------- --------- --------------
##       great confidence    103     15.61          15.61     10.30          10.30
##        some confidence    323     48.94          64.55     32.30          42.60
##       Hardly confidnce    234     35.45         100.00     23.40          66.00
##                   <NA>    340                              34.00         100.00
##                  Total   1000    100.00         100.00    100.00         100.00

freq(gss98$CONJUDGE)

## Frequencies  
## gss98$CONJUDGE  
## Type: Factor  
## 
##                          Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ---------------------- ------ --------- -------------- --------- --------------
##       great confidence    216     33.38          33.38     21.60          21.60
##        some confidence    347     53.63          87.02     34.70          56.30
##       Hardly confidnce     84     12.98         100.00      8.40          64.70
##                   <NA>    353                              35.30         100.00
##                  Total   1000    100.00         100.00    100.00         100.00

freq(gss98$CONLEGIS)

## Frequencies  
## gss98$CONLEGIS  
## Type: Factor  
## 
##                          Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ---------------------- ------ --------- -------------- --------- --------------
##       great confidence     67     10.18          10.18      6.70           6.70
##        some confidence    386     58.66          68.84     38.60          45.30
##       Hardly confidnce    205     31.16         100.00     20.50          65.80
##                   <NA>    342                              34.20         100.00
##                  Total   1000    100.00         100.00    100.00         100.00

freq(gss98$CONPRESS)

## Frequencies  
## gss98$CONPRESS  
## Type: Factor  
## 
##                          Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ---------------------- ------ --------- -------------- --------- --------------
##       great confidence     59      8.85           8.85      5.90           5.90
##        some confidence    341     51.12          59.97     34.10          40.00
##       Hardly confidnce    267     40.03         100.00     26.70          66.70
##                   <NA>    333                              33.30         100.00
##                  Total   1000    100.00         100.00    100.00         100.00

QUESTIONS:

A. Use the codebook for the survey to find the exact question wording for each variable, Type your answer below:

[VAR: CONCLERG] -- CONFIDENCE IN ORGANIZED RELIGION
[VAR: CONEDUC] -- CONFIDENCE IN EDUCATION
[VAR: CONFED] -- CONFID. IN EXEC BRANCH OF FED GOVT
[VAR: CONJUDGE] -- CONFID. IN UNITED STATES SUPREME COURT
[VAR: CONLEGIS] -- CONFIDENCE IN CONGRESS
[VAR: CONPRESS] -- CONFIDENCE IN PRESS

B. The following commands extract second column from each frequency table above (% valid) to construct a table comparing confidence in the six institutions. Rank order the six institutions from the one that Americans have the most confidence in to the one they have the least confidence in. Does it make any difference whether you rank order the institutions by the “great confidence” or the “hardly any confidence” percentages?

 data.frame( "CONCLERG" = freq(gss98$CONCLERG)[,2], 
             "CONEDUC" = freq(gss98$CONEDUC)[,2],
             "CONFED" = freq(gss98$CONFED)[,2],
             "CONJUDGE" = freq(gss98$CONJUDGE)[,2],
             "CONLEGIS" = freq(gss98$CONLEGIS)[,2],
             "CONPRESS" = freq(gss98$CONPRESS)[,2] ) %>% kable(digits = 1, caption = "Percent (%) Valid Responses")

Percent (%) Valid Responses
	CONCLERG	CONEDUC	CONFED	CONJUDGE	CONLEGIS	CONPRESS
great confidence	29.7	28.6	15.6	33.4	10.2	8.8
some confidence	50.8	56.4	48.9	53.6	58.7	51.1
Hardly confidnce	19.4	15.0	35.5	13.0	31.2	40.0
	NA	NA	NA	NA	NA	NA
Total	100.0	100.0	100.0	100.0	100.0	100.0

Ranking from greatest to least in terms of “great confidence” / greatest to least in terms of “hardly confidence”

Conjudge / Conpress
Conclerg / Confed
Coneduc / Conlegis
Confed / Conclerg
Conlegis / Coneduc
Conpress / Conjudge

It does make some of a difference as he institutions that are ranked highly in the “great confidence” parameter are generally ranked last in the “hardly confidence” parameter that are somewhat inverses. People that have great confidence in an institution generally don’t view it in a negative manner.

C. Write a short paragraph describing what you learn from the table. How much confidence do Americans seem to have in these institutions? Where do they place the greatest confidence? Use some percentages in your paragraph to make your points more explicit.

When it comes to the level of confidence of Americans in this different types of institutions, it rings true that of the sample shows that Amercians have the most confidence in the Supreme Court when the survey was conducted with 87% of valid respondents either showing "great confidence" or "some confidence" in their trust in the institution.

It is most likely plausible that while the respondents have generally lack confidence with the legistlature (31.2% "hardly confidence"), it is apparent that Americans are most dissatisfied with the press and executive branch with respondents marking a resounding 40% and 35.5% hardly confident respectively.

6. Histograms

Now, load a random sample of 500 observations from the 2000 U.S. Census Data from the file “Datasets/loan50.csv” into R using read.csv("Datasets/census.csv") coomand. Name the object census:

census <- read.csv("C:/Users/chenk/OneDrive/Documents/Spring 2020/PMAP 4041/Datasets/Class3/census.csv")

See the names and types of the variables in the dataset using names() and str() commands:

names(census)

## [1] "census_year"           "state_fips_code"       "total_family_income"  
## [4] "age"                   "sex"                   "race_general"         
## [7] "marital_status"        "total_personal_income"

str(census)

## 'data.frame':    500 obs. of  8 variables:
##  $ census_year          : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ state_fips_code      : Factor w/ 47 levels "Alabama","Arizona",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ total_family_income  : int  14550 22800 0 23000 48000 74000 23000 74000 60000 14600 ...
##  $ age                  : int  44 20 20 6 55 43 60 47 54 58 ...
##  $ sex                  : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 1 1 1 1 1 ...
##  $ race_general         : Factor w/ 8 levels "American Indian or Alaska Native",..: 7 8 2 8 8 8 8 8 2 8 ...
##  $ marital_status       : Factor w/ 6 levels "Divorced","Married/spouse absent",..: 3 4 4 4 3 3 3 3 3 6 ...
##  $ total_personal_income: int  0 13000 20000 NA 36000 27000 11800 48000 40000 14600 ...

QUESTIONS:

Describe the following variables, their types (levels of measurement), and appropriate type of frequency distribution graph:

[VAR: marital_status] -- This variable describes a person's state of being single, married, separated, divorced, or widowed, it is a categorical variable and
nominal, is best displayed on a bar graph 

[VAR: sex] -- This variable describes a person gender male or female, it is a catgorical variable and is nominal. This variable is best displayed on a bar graph

[VAR: age] -- This variable describe how old a person is, it is a numerical variable and is discrete. This variable is best displayed on a histogram. 

[VAR: total_personal_income] -- This variable describes a person's income which represents how much money a person makes, it is a categorical variable and is
ordinal. This variable is best displayed on a histogram.

Graph the frequency distribution for age:

census %>% ggplot(mapping = aes(x = age)) + geom_histogram( fill = "darkred" )

Describe the distribution:

The distribution is bimodal, skewed right and symmetric. The center of the distribution is around age 40 with the central peak at being age 40. There is a wide variablity range from 0 to 80 (age) and with a large portion of people around 5 to 40 (age).

Graph the frequency distribution for total_personal_income using x = total_personal_income/1000 in ggplot command. Add a title and pick your own color:

 census %>% ggplot(mapping = aes(x = total_personal_income/1000)) + geom_histogram(fill = "darkgreen") + labs(title = "Personal Income")

Describe the distribution:

This graph is unimodal and skewed right. The center of the graph is at 25-35 with many people in that column and with a peak of 30. The graph does not have much variablity as the graph is centralized at right around 0 to 125 (K = thousand).

7. Submission

Save your RMarkdown file, Knit an html report, and publish it on RPubs or save as a pdf file. Submit the link to the html or your pdf in the dropbox on iCollege.

Complete version of this assignment (only graphs and tables to check your work) is here:

Complete Assignment

Computer Assignment 2: Frequency Distributions

Kevin Chen

February 10 2020

1. Load Libraries, Set Your Working Directory, & Load Data

2. Interpreting Frequency Distributions: `RELIG`

3. Interpreting Frequency Distribution: `FEPRESCH`

4. Variable Types

5. Comparisons

6. Histograms

7. Submission

Computer Assignment 2: Frequency Distributions

Kevin Chen

February 10 2020

1. Load Libraries, Set Your Working Directory, & Load Data

2. Interpreting Frequency Distributions: RELIG

3. Interpreting Frequency Distribution: FEPRESCH

4. Variable Types

5. Comparisons

6. Histograms

7. Submission

2. Interpreting Frequency Distributions: `RELIG`

3. Interpreting Frequency Distribution: `FEPRESCH`