Project 2

Eunkyu Hahm

10/4/2019

Loading packages

Firstly, we will load packages that are used for data tidying and transforming

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

3 different ‘wide’ datasets

Dataset 1. Zika

1.Reading .csv file

Let’s read the .CSV file. In order to change type of variable “value” into numeric, use transform() and convert into numeric class.

## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs
## introduced by coercion
## [1] "numeric"

Question:
A. Which state or territories of US is the highest zika report count?
B. What the US zika reported data look like monthly in 2016?

  1. Modify the table
  1. Use Select() to exact necessary variables from table. We need rport_date, location, data_field, and value.
  1. Separate report_date variable into three columns; year, month and date

To do monthly analysis in 2016, we use separate() function to separate values by “-”. Before separate() values, since raw data has two different date types such as m/d/y and y_m_d, make them in same format y-m-d and make a separation. Also, after dividing variable, some of values in year variable has to be modified. For example “0015” is replaced into “2015”.

## Warning in NextMethod(.Generic): number of items to replace is not a
## multiple of replacement length
  1. Separate location into two variables

Because we will focus on the data only for US, use separate() function once again to separate location variable into two variables which are country and state. Then filter out data that only country equals to “United_States” (This includes states and US territories).

## Warning: Expected 2 pieces. Additional pieces discarded in 88610 rows
## [6305, 6306, 6307, 6308, 6309, 6310, 6311, 6312, 6313, 6314, 6315, 6316,
## 6317, 6318, 6319, 6320, 6321, 6322, 6323, 6324, ...].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1584 rows
## [6074, 6082, 6092, 6097, 6106, 6107, 6115, 6125, 6130, 6139, 6140, 6148,
## 6158, 6163, 6172, 6173, 6181, 6191, 6196, 6205, ...].

Analysis

  1. Total number of zika reports including states and US territories

Now we are calculating sum of reported cases(both by local and travel).Firstly Group_by state and calculates total number of case by states.

## Warning in data_field == c("zika_reported_local", "zika_reported_travel"):
## longer object length is not a multiple of shorter object length

Based on the table above, Puerto Rico which is part of US territory, has the most zika reported cases.

  1. Total number of zika reported case including only states.

Let’s compare sum of cases by states and try to figure out which state of US has the most of cases. USMAP package is used to make a map graph here.

  1. Zika local cases vs, cases after travel by states and US territories

Now we are going to divide into two categories which are reported locally and reported after travel. ggplot bar chart is used to compare all states and territories.

According to two barcharts, there is none of cases zika reported locally in US mainland in 2016. All of reported cases are from US territories like Pueto Rico, American-Samoa, and US virgin island. However, reported cases after travel are distributed in all across the US. New York is the highest states for reported cases after travel.

  1. Comparison sum of zika cases by month.

Let’s analyze the data by month and take a look at trend. Line plot is used to compare two types of data.

## Warning in data_field == c("zika_reported_local", "zika_reported_travel"):
## longer object length is not a multiple of shorter object length
## Warning: Ignoring unknown aesthetics: coler

Conclusion

A. Which state or territories of US is the highest zika report count? Puerto Rico where is incorporated US terriotry has the highest number of zika reports.
New York has the highest number of zika reports among US mainland states. B. What the US zika reported data look like monthly in 2016? According to the line graph above, local reports increase until May 2016 and diminish after that. But travel reports is getting more and more from February to June 2016.

Dataset 2. NYC SAT score

  1. Reading .csv file

Let’s read the information from the .csv file

Question:

A. How the student population looks like by race in all five boroughs?
B. How SAT scores are different by regions(boroughs)?

  1. Modify a table

Classes of the variables(percent.white, black, Hispanic, and Asian) are character. First, drop out “%” and transform the values into numeric type, so that it will be easier to do calculations later.

## [1]   NA  3.4 28.6 11.7  3.1  1.7
## [1] "numeric"
  1. Analysis

Now we are going to see student population distribution by racial status for all five boroughs. Since the table above is untidy format, using gather() function makes the table with tidy format. From the tidy format of data, barchart is created by using ggplot2.

Let’s compare average score of each subject(math, writing, and reading) by five boroughs.

Since sat_score1 table looks untidy, we are also going to make the table into tidy format with using gather() function. And then barchart is created in order to compare scores by the boroughs.

Conclusion

A. How the student population looks like by race in all five boroughs? According to chart “Student population by race in five boroughs”; * Bronx has the highest Hispanic students * Brooklyn has the most percentage of black students * Manhattan has the highest Hispanic student rate * Queens is the most evenly distributed, but Hispanic is the highest. Also there is the largest population of Asian students among five boroughs. * Staten Island has the highest white students

B. How SAT scores are different by regions(boroughs)? Based on barchart “Student SAT score by subject in five boroughs”, students from Staten Island have the highest average score of all three subjects. Whereas, students from Bronx has the lowest average score of all three subjects.

Dataset 3. Life expectancy

  1. Reading .csv file

Let’s read the .CSV file.

Question

A. What is the life expectancy between Developing countries vs. Developed countries?
B. What is the relationship between AdultMotality and BMI?

  1. Modify table and Analysis
  1. Let’s transform the table above in order to make an analysis easier. Since we need to compare Developing vs. Developed, grouped by status is used. This case boxplot is created in order to compare mean, median, IQR, minimum and maximum values.
## Warning: Removed 10 rows containing non-finite values (stat_boxplot).
## Warning: Removed 10 rows containing non-finite values (stat_summary).

  1. Next, we are going to select adult status, mortality rate, and BMI. And then the relationship will be found by using of scatter plot with line. Let’s make two different graphs by status of country.
## Warning: Removed 42 rows containing non-finite values (stat_smooth).
## Warning: Removed 42 rows containing missing values (geom_point).

Conclusion

A. What is the life expectancy between Developing countries vs. Developed countries? According to “Boxplot of Life expectancy by Status of country”, average and median of life expectancy in developed countries is higher than average and median in developing countries. A distribution in developed countries is much tighter than developing countries.

B. What is the relationship between AdultMotality and BMI? Based on “Scatterplot of BMI vs. Adult Mortality by Status of country”, as BMI is smaller, adult mortality is getting higher in developing countries. However, in developed countries, the relationship between BMI and adult mortality could not be found.