HW3 by Guodong Zhang

This is my Homework 3 for DACSS 601.

Guodong Zhang
2022/1/1

1. Identify the dataset.

The dataset, Chinese Real National Income Data, is from GitHub, and contains time series of real national income in China per section (index with 1952 = 100).

The variables in the dataset are following:

Variable Data type Description
index Number Index of data.
agriculture Number Real national income in agriculture sector.
commerce Number Real national income in commerce sector.
construction Number Real national income in construction sector.
industry Number Real national income in industry sector.
transport Number Real national income in transport sector.

2. Read in the dataset.

incoming_data <- read_csv("C:/Users/zhang/OneDrive - University of Massachusetts/_601/Sample Datasets/ChinaIncome.csv", show_col_types = FALSE)
incoming_data
# A tibble: 37 x 6
   index agriculture commerce construction industry transport
   <dbl>       <dbl>    <dbl>        <dbl>    <dbl>     <dbl>
 1     1       100       100          100      100       100 
 2     2       102.      133          138.     134.      120 
 3     3       103.      136.         133.     159.      136 
 4     4       112.      138.         152.     169.      140 
 5     5       116.      147.         262.     219.      164 
 6     6       120.      147.         243.     244.      176 
 7     7       120.      156.         367      384.      271.
 8     8       101.      170.         389.     502.      356.
 9     9        83.6     164.         394      541.      384.
10    10        84.7     130.         130.     316.      221.
# ... with 27 more rows

3. Research questions.

a. Which field grew the fastest during this period?

incoming_data %>%
    select(2:6) %>%
    apply(1, which.max)
 [1] 1 3 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
[34] 4 4 4 4

As we can see, the 4th field, which is industry, is almost always the maximal during this period. Thus, industry grew the fastest.

b. How about the correlation between these fields?

field_data = incoming_data[2:6]
for (i in 1:4) {
    for (j in (i+1):5) {
        x = as.vector(unlist(field_data[i]))
        y = as.vector(unlist(field_data[j]))
        cat("The correlation between the ",i,"th field and the ",j,"th field: ",sep="")
        cor(x,y) %>% cat('\n')
    }
}
The correlation between the 1th field and the 2th field: 0.9641547 
The correlation between the 1th field and the 3th field: 0.955028 
The correlation between the 1th field and the 4th field: 0.9670784 
The correlation between the 1th field and the 5th field: 0.9521588 
The correlation between the 2th field and the 3th field: 0.9880994 
The correlation between the 2th field and the 4th field: 0.9883226 
The correlation between the 2th field and the 5th field: 0.9881633 
The correlation between the 3th field and the 4th field: 0.9873144 
The correlation between the 3th field and the 5th field: 0.9936743 
The correlation between the 4th field and the 5th field: 0.9927794 

As we can see, every two of the five fields have a strong correlation, but the 3th field, construction, and the 5th field, transport, have the strongest one.