Overview

For this project, I chose 3 of the example untidy datasets suggested by other students. I then converted them to csv files as necessary, and loaded them into RStudio. Finally, I performed the suggested analyses on the datasets.

pacman::p_load(dplyr, tidyr, magrittr, stringr, ggplot2)

wide_data <- read.csv('gender_education.csv')
wide_data
##             Region Male_NoHS Male_HS Male_Associate Male_Bachelors
## 1 Northeast Region     24801   36714          44765          65177
## 2   Midwest Region     23530   33606          41180          59243
## 3     South Region     22065   31552          40097          59711
## 4      West Region     22554   32289          42127          62069
##   Male_Graduate Female_NoHS Female_HS Female_Associate Female_Bachelors
## 1         89449       16523     24251            31455            45593
## 2         77325       15471     21713            26944            39534
## 3         81744       14659     21530            27243            41210
## 4         86476       15250     23230            29402            42527
##   Female_Graduate
## 1           61203
## 2           54246
## 3           53100
## 4           57914

This is a data frame of highest educational attainment of males and females by region. To tidy the data, I will group the various columns of gender/education combinations into one column, Gender_Education.

ge_data <- gather(wide_data, 'Gender_Education', 'Count', 2:11)
head(ge_data)
##             Region Gender_Education Count
## 1 Northeast Region        Male_NoHS 24801
## 2   Midwest Region        Male_NoHS 23530
## 3     South Region        Male_NoHS 22065
## 4      West Region        Male_NoHS 22554
## 5 Northeast Region          Male_HS 36714
## 6   Midwest Region          Male_HS 33606

I will further organize the data by splitting Gender_Education into two separate columns, Gender and Education, using substring detection.

ge_data %<>%
  mutate(Gender = ifelse(grepl("Fem", Gender_Education), "Female", "Male")) %>%
  mutate(Education = str_remove_all(Gender_Education, "Female_")) %>%
  mutate(Education = str_remove_all(Education, "Male_")) %>%
  select(-Gender_Education)

ge_data <- ge_data[c(1,3,4,2)]
head(ge_data)
##             Region Gender Education Count
## 1 Northeast Region   Male      NoHS 24801
## 2   Midwest Region   Male      NoHS 23530
## 3     South Region   Male      NoHS 22065
## 4      West Region   Male      NoHS 22554
## 5 Northeast Region   Male        HS 36714
## 6   Midwest Region   Male        HS 33606

The analysis required for this dataset is to find which region has the highest proportion of people with High School as their highest educational attainment. There were other analyses suggested involving average income, but that will not be possible with the given data. To do this, I will seperate the data frame into 4 separate data frames by region, and calculate the entire total and the total who only completed High School for each region.

NE <- ge_data %>%
  filter(Region == 'Northeast Region')

NEtotal <- sum(NE$Count)
NEhs <- sum(NE$Count[NE$Education == 'HS'])

MW <- ge_data %>%
  filter(Region == 'Midwest Region')

MWtotal <- sum(MW$Count)
MWhs <- sum(MW$Count[MW$Education == 'HS'])

S <- ge_data %>%
  filter(Region == 'South Region')

Stotal <- sum(S$Count)
Shs <- sum(S$Count[S$Education == 'HS'])

W <- ge_data %>%
  filter(Region == 'West Region')

Wtotal <- sum(W$Count)
Whs <- sum(W$Count[W$Education == 'HS'])
NE_HSmost <- NEhs / NEtotal
MW_HSmost <- MWhs / MWtotal
S_HSmost <- Shs / Stotal
W_HSmost <- Whs / Wtotal

regions_hs <- c(NE_HSmost, MW_HSmost, S_HSmost, W_HSmost)

cat("The proportion of people in the Northeast with the highest educational attainment of High School is", NE_HSmost, "\nThe proportion of people in the Midwest with the highest educational attainment of High School is", MW_HSmost, "\nThe proportion of people in the South  with the highest educational attainment of High School is", S_HSmost, "\nThe proportion of people in the West with the highest educational attainment of High School is", W_HSmost, "\n\nThe highest proportion is", max(regions_hs), "which is from the Midwest region.")
## The proportion of people in the Northeast with the highest educational attainment of High School is 0.1385785 
## The proportion of people in the Midwest with the highest educational attainment of High School is 0.1408354 
## The proportion of people in the South  with the highest educational attainment of High School is 0.1350993 
## The proportion of people in the West with the highest educational attainment of High School is 0.1341564 
## 
## The highest proportion is 0.1408354 which is from the Midwest region.
wide_data <- read.csv('time_series_19-covid-Confirmed.csv')
head(wide_data)
##   Province.State Country.Region     Lat     Long X1.22.20 X1.23.20 X1.24.20
## 1          Anhui Mainland China 31.8257 117.2264        1        9       15
## 2        Beijing Mainland China 40.1824 116.4142       14       22       36
## 3      Chongqing Mainland China 30.0572 107.8740        6        9       27
## 4         Fujian Mainland China 26.0789 117.9874        1        5       10
## 5          Gansu Mainland China 36.0611 103.8343        0        2        2
## 6      Guangdong Mainland China 23.3417 113.4244       26       32       53
##   X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20 X1.31.20 X2.1.20
## 1       39       60       70      106      152      200      237     297
## 2       41       68       80       91      111      114      139     168
## 3       57       75      110      132      147      182      211     247
## 4       18       35       59       80       84      101      120     144
## 5        4        7       14       19       24       26       29      40
## 6       78      111      151      207      277      354      436     535
##   X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20 X2.8.20 X2.9.20 X2.10.20
## 1     340     408     480     530     591     665     733     779      830
## 2     191     212     228     253     274     297     315     326      337
## 3     300     337     366     389     411     426     428     468      486
## 4     159     179     194     205     215     224     239     250      261
## 5      51      55      57      62      62      67      79      83       83
## 6     632     725     813     895     970    1034    1095    1131     1159
##   X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20 X2.16.20 X2.17.20 X2.18.20
## 1      860      889      910      934      950      962      973      982
## 2      342      352      366      372      375      380      381      387
## 3      505      518      529      537      544      551      553      555
## 4      267      272      279      281      285      287      290      292
## 5       86       87       90       90       90       90       91       91
## 6     1177     1219     1241     1261     1294     1316     1322     1328
##   X2.19.20 X2.20.20 X2.21.20 X2.22.20 X2.23.20 X2.24.20 X2.25.20 X2.26.20
## 1      986      987      988      989      989      989      989      989
## 2      393      395      396      399      399      399      400      400
## 3      560      567      572      573      575      576      576      576
## 4      293      293      293      293      293      293      294      294
## 5       91       91       91       91       91       91       91       91
## 6     1331     1332     1333     1339     1342     1345     1347     1347
##   X2.27.20 X2.28.20 X2.29.20 X3.1.20 X3.2.20 X3.3.20 X3.4.20 X3.5.20 X3.6.20
## 1      989      990      990     990     990     990     990     990     990
## 2      410      410      411     413     414     414     418     418     422
## 3      576      576      576     576     576     576     576     576     576
## 4      296      296      296     296     296     296     296     296     296
## 5       91       91       91      91      91      91      91     102     119
## 6     1347     1348     1349    1349    1350    1350    1350    1351    1352

This is a data set of confirmed coronavirus cases in various regions, between the January 22nd and March 6th. I will start tidying this data by changing the name of the date columns to numbers, representing day 1 of the collection until day 45.

wide_data %<>%
  rename("Province" = Province.State, "Country" = Country.Region, "1" = X1.22.20, "2" = X1.23.20, "3" = X1.24.20, "4" = X1.25.20, "5" = X1.26.20, "6" = X1.27.20, "7" = X1.28.20, "8" = X1.29.20, "9" = X1.30.20, "10" = X1.31.20, "11" = X2.1.20, "12" = X2.2.20, "13" = X2.3.20, "14" = X2.4.20, "15" = X2.5.20, "16" = X2.6.20, "17" = X2.7.20, "18" = X2.8.20, "19" = X2.9.20, "20" = X2.10.20, "21" = X2.11.20, "22" = X2.12.20, "23" = X2.13.20, "24" = X2.14.20, "25" = X2.15.20, "26" = X2.16.20, "27" = X2.17.20, "28" = X2.18.20, "29" = X2.19.20, "30" = X2.20.20, "31" = X2.21.20, "32" = X2.22.20, "33" = X2.23.20, "34" = X2.24.20, "35" = X2.25.20, "36" = X2.26.20, "37" = X2.27.20, "38" = X2.28.20, "39" = X2.29.20, "40" = X3.1.20, "41" = X3.2.20, "42" = X3.3.20, "43" = X3.4.20, "44" = X3.5.20, "45" = X3.6.20)

head(wide_data)
##    Province        Country     Lat     Long  1  2  3  4   5   6   7   8   9  10
## 1     Anhui Mainland China 31.8257 117.2264  1  9 15 39  60  70 106 152 200 237
## 2   Beijing Mainland China 40.1824 116.4142 14 22 36 41  68  80  91 111 114 139
## 3 Chongqing Mainland China 30.0572 107.8740  6  9 27 57  75 110 132 147 182 211
## 4    Fujian Mainland China 26.0789 117.9874  1  5 10 18  35  59  80  84 101 120
## 5     Gansu Mainland China 36.0611 103.8343  0  2  2  4   7  14  19  24  26  29
## 6 Guangdong Mainland China 23.3417 113.4244 26 32 53 78 111 151 207 277 354 436
##    11  12  13  14  15  16   17   18   19   20   21   22   23   24   25   26
## 1 297 340 408 480 530 591  665  733  779  830  860  889  910  934  950  962
## 2 168 191 212 228 253 274  297  315  326  337  342  352  366  372  375  380
## 3 247 300 337 366 389 411  426  428  468  486  505  518  529  537  544  551
## 4 144 159 179 194 205 215  224  239  250  261  267  272  279  281  285  287
## 5  40  51  55  57  62  62   67   79   83   83   86   87   90   90   90   90
## 6 535 632 725 813 895 970 1034 1095 1131 1159 1177 1219 1241 1261 1294 1316
##     27   28   29   30   31   32   33   34   35   36   37   38   39   40   41
## 1  973  982  986  987  988  989  989  989  989  989  989  990  990  990  990
## 2  381  387  393  395  396  399  399  399  400  400  410  410  411  413  414
## 3  553  555  560  567  572  573  575  576  576  576  576  576  576  576  576
## 4  290  292  293  293  293  293  293  293  294  294  296  296  296  296  296
## 5   91   91   91   91   91   91   91   91   91   91   91   91   91   91   91
## 6 1322 1328 1331 1332 1333 1339 1342 1345 1347 1347 1347 1348 1349 1349 1350
##     42   43   44   45
## 1  990  990  990  990
## 2  414  418  418  422
## 3  576  576  576  576
## 4  296  296  296  296
## 5   91   91  102  119
## 6 1350 1350 1351 1352

I can now group the original dates, now represented as days of collection, into a single column, Day.

cv_data <- gather(wide_data, 'Day', 'Infections', 5:49)

cv_data %<>% mutate(Day = as.integer(Day))
head(cv_data)
##    Province        Country     Lat     Long Day Infections
## 1     Anhui Mainland China 31.8257 117.2264   1          1
## 2   Beijing Mainland China 40.1824 116.4142   1         14
## 3 Chongqing Mainland China 30.0572 107.8740   1          6
## 4    Fujian Mainland China 26.0789 117.9874   1          1
## 5     Gansu Mainland China 36.0611 103.8343   1          0
## 6 Guangdong Mainland China 23.3417 113.4244   1         26

The insights from this data that we are looking for are: How the number of affected cases are getting changed over time? Describe cases over time at country level. What is the latest number of cases?

To find the change over time, we first need to find the total infections by day.

day_count <- rep(NA, 45)

for(i in 1:45){
  day_count[i] = sum(cv_data$Infections[cv_data$Day == i])
}

plot(1:45, day_count, xlab = 'Day', ylab = 'Confirmed Infections')

In this plot, the change over time of infections appears linear, but this is misleading if you are trying to understand how rapidly the virus spreads. This plot included all countries in the data frame, which may or may not have already been exposed to the virus yet on a given day. A huge majority of the infections are also coming from China alone. It would be more effective to make plots for individual countries.

The following loop will plot the infections over time for each individual country in the dataset, given that the country has at least 20 infections.

day_count <- rep(NA, 45)
countries = unique(cv_data$Country)

for (c in 1:length(countries)){
  d <- cv_data %>%
    filter(Country == countries[c])
  if(max(d$Infections) < 20){
    next
  }
  for(i in 1:45){
    day_count[i] = sum(d$Infections[d$Day == i])
  }
  plot(1:45, day_count, xlab = 'Day', ylab = 'Confirmed Infections', main = countries[c])
}

Most of the countries seem to have exponential growth in infections over time, while a few such as Thailand and Taiwan appear to have linear growth.

current_infections <- sum(cv_data$Infections[cv_data$Day == 45])

cat("The latest count of confirmed infections is:", current_infections)
## The latest count of confirmed infections is: 101800
wide <- read.csv('studentloans.csv')
wide
##     Amount.Borrowed White Black_or_African_American Hispanic_or_Latino Asian
## 1       All Amounts   60%                       17%                14%    4%
## 2          < $10000   56%                       18%                18%    3%
## 3  $10001 to $20000   65%                       13%                13%    4%
## 4  $20001 to $40000   62%                       17%                12%    4%
## 5  $40001 to $60000   57%                       22%                11%    3%
## 6  $60001 to $80000   61%                       22%                 8%    4%
## 7 $80001 to $100000   58%                       26%                 6%    5%
## 8         > $100000   59%                       16%                 8%   11%
##   Other
## 1    5%
## 2    5%
## 3    5%
## 4    5%
## 5    6%
## 6    4%
## 7    5%
## 8    6%

This is a data frame of proportions of loan holders of various amounts by race. I will start tidying the data by transforming the percentages into numeric values.

wide %<>%
  rename(Black = Black_or_African_American, Hispanic = Hispanic_or_Latino) %>%
  mutate(White = str_remove_all(White, "%")) %>%
  mutate(Black = str_remove_all(Black, "%")) %>%
  mutate(Hispanic = str_remove_all(Hispanic, "%")) %>%
  mutate(Asian = str_remove_all(Asian, "%")) %>%
  mutate(Other = str_remove_all(Other, "%")) %>%
  mutate(White = as.numeric(White)) %>%
  mutate(Black = as.numeric(Black)) %>%
  mutate(Hispanic = as.numeric(Hispanic)) %>%
  mutate(Asian = as.numeric(Asian)) %>%
  mutate(Other = as.numeric(Other))

wide
##     Amount.Borrowed White Black Hispanic Asian Other
## 1       All Amounts    60    17       14     4     5
## 2          < $10000    56    18       18     3     5
## 3  $10001 to $20000    65    13       13     4     5
## 4  $20001 to $40000    62    17       12     4     5
## 5  $40001 to $60000    57    22       11     3     6
## 6  $60001 to $80000    61    22        8     4     4
## 7 $80001 to $100000    58    26        6     5     5
## 8         > $100000    59    16        8    11     6

The values are percentages, so we need to divide these values by 100.

wide %<>%
  mutate(White = White / 100) %>%
  mutate(Black = Black / 100) %>%
  mutate(Hispanic = Hispanic / 100) %>%
  mutate(Asian = Asian / 100) %>%
  mutate(Other = Other / 100) 

wide
##     Amount.Borrowed White Black Hispanic Asian Other
## 1       All Amounts  0.60  0.17     0.14  0.04  0.05
## 2          < $10000  0.56  0.18     0.18  0.03  0.05
## 3  $10001 to $20000  0.65  0.13     0.13  0.04  0.05
## 4  $20001 to $40000  0.62  0.17     0.12  0.04  0.05
## 5  $40001 to $60000  0.57  0.22     0.11  0.03  0.06
## 6  $60001 to $80000  0.61  0.22     0.08  0.04  0.04
## 7 $80001 to $100000  0.58  0.26     0.06  0.05  0.05
## 8         > $100000  0.59  0.16     0.08  0.11  0.06

We can now transform the data from wide to long by grouping the race columns into one column, race.

sl <- gather(wide, 'Race', 'Proportion', 2:6)
head(sl)
##    Amount.Borrowed  Race Proportion
## 1      All Amounts White       0.60
## 2         < $10000 White       0.56
## 3 $10001 to $20000 White       0.65
## 4 $20001 to $40000 White       0.62
## 5 $40001 to $60000 White       0.57
## 6 $60001 to $80000 White       0.61

The analysis suggested for this dataset would require outside data, so I will use the data to find which races have to highest and lowest proportions of loans for each value range.

Holders of highest proportion of each amount borrowed range:

sl %>%
  group_by(Amount.Borrowed) %>%
  filter(Proportion == max(Proportion))
## # A tibble: 8 x 3
## # Groups:   Amount.Borrowed [8]
##   Amount.Borrowed   Race  Proportion
##   <fct>             <chr>      <dbl>
## 1 All Amounts       White      0.6  
## 2 < $10000          White      0.56 
## 3 $10001 to $20000  White      0.65 
## 4 $20001 to $40000  White      0.62 
## 5 $40001 to $60000  White      0.570
## 6 $60001 to $80000  White      0.61 
## 7 $80001 to $100000 White      0.580
## 8 > $100000         White      0.59

Holders of lowest proportion of each amount borrowed range:

sl %>%
  group_by(Amount.Borrowed) %>%
  filter(Proportion == min(Proportion))
## # A tibble: 10 x 3
## # Groups:   Amount.Borrowed [8]
##    Amount.Borrowed   Race  Proportion
##    <fct>             <chr>      <dbl>
##  1 All Amounts       Asian       0.04
##  2 < $10000          Asian       0.03
##  3 $10001 to $20000  Asian       0.04
##  4 $20001 to $40000  Asian       0.04
##  5 $40001 to $60000  Asian       0.03
##  6 $60001 to $80000  Asian       0.04
##  7 $80001 to $100000 Asian       0.05
##  8 $60001 to $80000  Other       0.04
##  9 $80001 to $100000 Other       0.05
## 10 > $100000         Other       0.06

The races Asian and Other are tied for the lowest proportion of loans at the range of $60,000+