For this project, I chose 3 of the example untidy datasets suggested by other students. I then converted them to csv files as necessary, and loaded them into RStudio. Finally, I performed the suggested analyses on the datasets.
pacman::p_load(dplyr, tidyr, magrittr, stringr, ggplot2)
wide_data <- read.csv('gender_education.csv')
wide_data
## Region Male_NoHS Male_HS Male_Associate Male_Bachelors
## 1 Northeast Region 24801 36714 44765 65177
## 2 Midwest Region 23530 33606 41180 59243
## 3 South Region 22065 31552 40097 59711
## 4 West Region 22554 32289 42127 62069
## Male_Graduate Female_NoHS Female_HS Female_Associate Female_Bachelors
## 1 89449 16523 24251 31455 45593
## 2 77325 15471 21713 26944 39534
## 3 81744 14659 21530 27243 41210
## 4 86476 15250 23230 29402 42527
## Female_Graduate
## 1 61203
## 2 54246
## 3 53100
## 4 57914
This is a data frame of highest educational attainment of males and females by region. To tidy the data, I will group the various columns of gender/education combinations into one column, Gender_Education.
ge_data <- gather(wide_data, 'Gender_Education', 'Count', 2:11)
head(ge_data)
## Region Gender_Education Count
## 1 Northeast Region Male_NoHS 24801
## 2 Midwest Region Male_NoHS 23530
## 3 South Region Male_NoHS 22065
## 4 West Region Male_NoHS 22554
## 5 Northeast Region Male_HS 36714
## 6 Midwest Region Male_HS 33606
I will further organize the data by splitting Gender_Education into two separate columns, Gender and Education, using substring detection.
ge_data %<>%
mutate(Gender = ifelse(grepl("Fem", Gender_Education), "Female", "Male")) %>%
mutate(Education = str_remove_all(Gender_Education, "Female_")) %>%
mutate(Education = str_remove_all(Education, "Male_")) %>%
select(-Gender_Education)
ge_data <- ge_data[c(1,3,4,2)]
head(ge_data)
## Region Gender Education Count
## 1 Northeast Region Male NoHS 24801
## 2 Midwest Region Male NoHS 23530
## 3 South Region Male NoHS 22065
## 4 West Region Male NoHS 22554
## 5 Northeast Region Male HS 36714
## 6 Midwest Region Male HS 33606
The analysis required for this dataset is to find which region has the highest proportion of people with High School as their highest educational attainment. There were other analyses suggested involving average income, but that will not be possible with the given data. To do this, I will seperate the data frame into 4 separate data frames by region, and calculate the entire total and the total who only completed High School for each region.
NE <- ge_data %>%
filter(Region == 'Northeast Region')
NEtotal <- sum(NE$Count)
NEhs <- sum(NE$Count[NE$Education == 'HS'])
MW <- ge_data %>%
filter(Region == 'Midwest Region')
MWtotal <- sum(MW$Count)
MWhs <- sum(MW$Count[MW$Education == 'HS'])
S <- ge_data %>%
filter(Region == 'South Region')
Stotal <- sum(S$Count)
Shs <- sum(S$Count[S$Education == 'HS'])
W <- ge_data %>%
filter(Region == 'West Region')
Wtotal <- sum(W$Count)
Whs <- sum(W$Count[W$Education == 'HS'])
NE_HSmost <- NEhs / NEtotal
MW_HSmost <- MWhs / MWtotal
S_HSmost <- Shs / Stotal
W_HSmost <- Whs / Wtotal
regions_hs <- c(NE_HSmost, MW_HSmost, S_HSmost, W_HSmost)
cat("The proportion of people in the Northeast with the highest educational attainment of High School is", NE_HSmost, "\nThe proportion of people in the Midwest with the highest educational attainment of High School is", MW_HSmost, "\nThe proportion of people in the South with the highest educational attainment of High School is", S_HSmost, "\nThe proportion of people in the West with the highest educational attainment of High School is", W_HSmost, "\n\nThe highest proportion is", max(regions_hs), "which is from the Midwest region.")
## The proportion of people in the Northeast with the highest educational attainment of High School is 0.1385785
## The proportion of people in the Midwest with the highest educational attainment of High School is 0.1408354
## The proportion of people in the South with the highest educational attainment of High School is 0.1350993
## The proportion of people in the West with the highest educational attainment of High School is 0.1341564
##
## The highest proportion is 0.1408354 which is from the Midwest region.
wide_data <- read.csv('time_series_19-covid-Confirmed.csv')
head(wide_data)
## Province.State Country.Region Lat Long X1.22.20 X1.23.20 X1.24.20
## 1 Anhui Mainland China 31.8257 117.2264 1 9 15
## 2 Beijing Mainland China 40.1824 116.4142 14 22 36
## 3 Chongqing Mainland China 30.0572 107.8740 6 9 27
## 4 Fujian Mainland China 26.0789 117.9874 1 5 10
## 5 Gansu Mainland China 36.0611 103.8343 0 2 2
## 6 Guangdong Mainland China 23.3417 113.4244 26 32 53
## X1.25.20 X1.26.20 X1.27.20 X1.28.20 X1.29.20 X1.30.20 X1.31.20 X2.1.20
## 1 39 60 70 106 152 200 237 297
## 2 41 68 80 91 111 114 139 168
## 3 57 75 110 132 147 182 211 247
## 4 18 35 59 80 84 101 120 144
## 5 4 7 14 19 24 26 29 40
## 6 78 111 151 207 277 354 436 535
## X2.2.20 X2.3.20 X2.4.20 X2.5.20 X2.6.20 X2.7.20 X2.8.20 X2.9.20 X2.10.20
## 1 340 408 480 530 591 665 733 779 830
## 2 191 212 228 253 274 297 315 326 337
## 3 300 337 366 389 411 426 428 468 486
## 4 159 179 194 205 215 224 239 250 261
## 5 51 55 57 62 62 67 79 83 83
## 6 632 725 813 895 970 1034 1095 1131 1159
## X2.11.20 X2.12.20 X2.13.20 X2.14.20 X2.15.20 X2.16.20 X2.17.20 X2.18.20
## 1 860 889 910 934 950 962 973 982
## 2 342 352 366 372 375 380 381 387
## 3 505 518 529 537 544 551 553 555
## 4 267 272 279 281 285 287 290 292
## 5 86 87 90 90 90 90 91 91
## 6 1177 1219 1241 1261 1294 1316 1322 1328
## X2.19.20 X2.20.20 X2.21.20 X2.22.20 X2.23.20 X2.24.20 X2.25.20 X2.26.20
## 1 986 987 988 989 989 989 989 989
## 2 393 395 396 399 399 399 400 400
## 3 560 567 572 573 575 576 576 576
## 4 293 293 293 293 293 293 294 294
## 5 91 91 91 91 91 91 91 91
## 6 1331 1332 1333 1339 1342 1345 1347 1347
## X2.27.20 X2.28.20 X2.29.20 X3.1.20 X3.2.20 X3.3.20 X3.4.20 X3.5.20 X3.6.20
## 1 989 990 990 990 990 990 990 990 990
## 2 410 410 411 413 414 414 418 418 422
## 3 576 576 576 576 576 576 576 576 576
## 4 296 296 296 296 296 296 296 296 296
## 5 91 91 91 91 91 91 91 102 119
## 6 1347 1348 1349 1349 1350 1350 1350 1351 1352
This is a data set of confirmed coronavirus cases in various regions, between the January 22nd and March 6th. I will start tidying this data by changing the name of the date columns to numbers, representing day 1 of the collection until day 45.
wide_data %<>%
rename("Province" = Province.State, "Country" = Country.Region, "1" = X1.22.20, "2" = X1.23.20, "3" = X1.24.20, "4" = X1.25.20, "5" = X1.26.20, "6" = X1.27.20, "7" = X1.28.20, "8" = X1.29.20, "9" = X1.30.20, "10" = X1.31.20, "11" = X2.1.20, "12" = X2.2.20, "13" = X2.3.20, "14" = X2.4.20, "15" = X2.5.20, "16" = X2.6.20, "17" = X2.7.20, "18" = X2.8.20, "19" = X2.9.20, "20" = X2.10.20, "21" = X2.11.20, "22" = X2.12.20, "23" = X2.13.20, "24" = X2.14.20, "25" = X2.15.20, "26" = X2.16.20, "27" = X2.17.20, "28" = X2.18.20, "29" = X2.19.20, "30" = X2.20.20, "31" = X2.21.20, "32" = X2.22.20, "33" = X2.23.20, "34" = X2.24.20, "35" = X2.25.20, "36" = X2.26.20, "37" = X2.27.20, "38" = X2.28.20, "39" = X2.29.20, "40" = X3.1.20, "41" = X3.2.20, "42" = X3.3.20, "43" = X3.4.20, "44" = X3.5.20, "45" = X3.6.20)
head(wide_data)
## Province Country Lat Long 1 2 3 4 5 6 7 8 9 10
## 1 Anhui Mainland China 31.8257 117.2264 1 9 15 39 60 70 106 152 200 237
## 2 Beijing Mainland China 40.1824 116.4142 14 22 36 41 68 80 91 111 114 139
## 3 Chongqing Mainland China 30.0572 107.8740 6 9 27 57 75 110 132 147 182 211
## 4 Fujian Mainland China 26.0789 117.9874 1 5 10 18 35 59 80 84 101 120
## 5 Gansu Mainland China 36.0611 103.8343 0 2 2 4 7 14 19 24 26 29
## 6 Guangdong Mainland China 23.3417 113.4244 26 32 53 78 111 151 207 277 354 436
## 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 1 297 340 408 480 530 591 665 733 779 830 860 889 910 934 950 962
## 2 168 191 212 228 253 274 297 315 326 337 342 352 366 372 375 380
## 3 247 300 337 366 389 411 426 428 468 486 505 518 529 537 544 551
## 4 144 159 179 194 205 215 224 239 250 261 267 272 279 281 285 287
## 5 40 51 55 57 62 62 67 79 83 83 86 87 90 90 90 90
## 6 535 632 725 813 895 970 1034 1095 1131 1159 1177 1219 1241 1261 1294 1316
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
## 1 973 982 986 987 988 989 989 989 989 989 989 990 990 990 990
## 2 381 387 393 395 396 399 399 399 400 400 410 410 411 413 414
## 3 553 555 560 567 572 573 575 576 576 576 576 576 576 576 576
## 4 290 292 293 293 293 293 293 293 294 294 296 296 296 296 296
## 5 91 91 91 91 91 91 91 91 91 91 91 91 91 91 91
## 6 1322 1328 1331 1332 1333 1339 1342 1345 1347 1347 1347 1348 1349 1349 1350
## 42 43 44 45
## 1 990 990 990 990
## 2 414 418 418 422
## 3 576 576 576 576
## 4 296 296 296 296
## 5 91 91 102 119
## 6 1350 1350 1351 1352
I can now group the original dates, now represented as days of collection, into a single column, Day.
cv_data <- gather(wide_data, 'Day', 'Infections', 5:49)
cv_data %<>% mutate(Day = as.integer(Day))
head(cv_data)
## Province Country Lat Long Day Infections
## 1 Anhui Mainland China 31.8257 117.2264 1 1
## 2 Beijing Mainland China 40.1824 116.4142 1 14
## 3 Chongqing Mainland China 30.0572 107.8740 1 6
## 4 Fujian Mainland China 26.0789 117.9874 1 1
## 5 Gansu Mainland China 36.0611 103.8343 1 0
## 6 Guangdong Mainland China 23.3417 113.4244 1 26
The insights from this data that we are looking for are: How the number of affected cases are getting changed over time? Describe cases over time at country level. What is the latest number of cases?
To find the change over time, we first need to find the total infections by day.
day_count <- rep(NA, 45)
for(i in 1:45){
day_count[i] = sum(cv_data$Infections[cv_data$Day == i])
}
plot(1:45, day_count, xlab = 'Day', ylab = 'Confirmed Infections')
In this plot, the change over time of infections appears linear, but this is misleading if you are trying to understand how rapidly the virus spreads. This plot included all countries in the data frame, which may or may not have already been exposed to the virus yet on a given day. A huge majority of the infections are also coming from China alone. It would be more effective to make plots for individual countries.
The following loop will plot the infections over time for each individual country in the dataset, given that the country has at least 20 infections.
day_count <- rep(NA, 45)
countries = unique(cv_data$Country)
for (c in 1:length(countries)){
d <- cv_data %>%
filter(Country == countries[c])
if(max(d$Infections) < 20){
next
}
for(i in 1:45){
day_count[i] = sum(d$Infections[d$Day == i])
}
plot(1:45, day_count, xlab = 'Day', ylab = 'Confirmed Infections', main = countries[c])
}
Most of the countries seem to have exponential growth in infections over time, while a few such as Thailand and Taiwan appear to have linear growth.
current_infections <- sum(cv_data$Infections[cv_data$Day == 45])
cat("The latest count of confirmed infections is:", current_infections)
## The latest count of confirmed infections is: 101800
wide <- read.csv('studentloans.csv')
wide
## Amount.Borrowed White Black_or_African_American Hispanic_or_Latino Asian
## 1 All Amounts 60% 17% 14% 4%
## 2 < $10000 56% 18% 18% 3%
## 3 $10001 to $20000 65% 13% 13% 4%
## 4 $20001 to $40000 62% 17% 12% 4%
## 5 $40001 to $60000 57% 22% 11% 3%
## 6 $60001 to $80000 61% 22% 8% 4%
## 7 $80001 to $100000 58% 26% 6% 5%
## 8 > $100000 59% 16% 8% 11%
## Other
## 1 5%
## 2 5%
## 3 5%
## 4 5%
## 5 6%
## 6 4%
## 7 5%
## 8 6%
This is a data frame of proportions of loan holders of various amounts by race. I will start tidying the data by transforming the percentages into numeric values.
wide %<>%
rename(Black = Black_or_African_American, Hispanic = Hispanic_or_Latino) %>%
mutate(White = str_remove_all(White, "%")) %>%
mutate(Black = str_remove_all(Black, "%")) %>%
mutate(Hispanic = str_remove_all(Hispanic, "%")) %>%
mutate(Asian = str_remove_all(Asian, "%")) %>%
mutate(Other = str_remove_all(Other, "%")) %>%
mutate(White = as.numeric(White)) %>%
mutate(Black = as.numeric(Black)) %>%
mutate(Hispanic = as.numeric(Hispanic)) %>%
mutate(Asian = as.numeric(Asian)) %>%
mutate(Other = as.numeric(Other))
wide
## Amount.Borrowed White Black Hispanic Asian Other
## 1 All Amounts 60 17 14 4 5
## 2 < $10000 56 18 18 3 5
## 3 $10001 to $20000 65 13 13 4 5
## 4 $20001 to $40000 62 17 12 4 5
## 5 $40001 to $60000 57 22 11 3 6
## 6 $60001 to $80000 61 22 8 4 4
## 7 $80001 to $100000 58 26 6 5 5
## 8 > $100000 59 16 8 11 6
The values are percentages, so we need to divide these values by 100.
wide %<>%
mutate(White = White / 100) %>%
mutate(Black = Black / 100) %>%
mutate(Hispanic = Hispanic / 100) %>%
mutate(Asian = Asian / 100) %>%
mutate(Other = Other / 100)
wide
## Amount.Borrowed White Black Hispanic Asian Other
## 1 All Amounts 0.60 0.17 0.14 0.04 0.05
## 2 < $10000 0.56 0.18 0.18 0.03 0.05
## 3 $10001 to $20000 0.65 0.13 0.13 0.04 0.05
## 4 $20001 to $40000 0.62 0.17 0.12 0.04 0.05
## 5 $40001 to $60000 0.57 0.22 0.11 0.03 0.06
## 6 $60001 to $80000 0.61 0.22 0.08 0.04 0.04
## 7 $80001 to $100000 0.58 0.26 0.06 0.05 0.05
## 8 > $100000 0.59 0.16 0.08 0.11 0.06
We can now transform the data from wide to long by grouping the race columns into one column, race.
sl <- gather(wide, 'Race', 'Proportion', 2:6)
head(sl)
## Amount.Borrowed Race Proportion
## 1 All Amounts White 0.60
## 2 < $10000 White 0.56
## 3 $10001 to $20000 White 0.65
## 4 $20001 to $40000 White 0.62
## 5 $40001 to $60000 White 0.57
## 6 $60001 to $80000 White 0.61
The analysis suggested for this dataset would require outside data, so I will use the data to find which races have to highest and lowest proportions of loans for each value range.
Holders of highest proportion of each amount borrowed range:
sl %>%
group_by(Amount.Borrowed) %>%
filter(Proportion == max(Proportion))
## # A tibble: 8 x 3
## # Groups: Amount.Borrowed [8]
## Amount.Borrowed Race Proportion
## <fct> <chr> <dbl>
## 1 All Amounts White 0.6
## 2 < $10000 White 0.56
## 3 $10001 to $20000 White 0.65
## 4 $20001 to $40000 White 0.62
## 5 $40001 to $60000 White 0.570
## 6 $60001 to $80000 White 0.61
## 7 $80001 to $100000 White 0.580
## 8 > $100000 White 0.59
Holders of lowest proportion of each amount borrowed range:
sl %>%
group_by(Amount.Borrowed) %>%
filter(Proportion == min(Proportion))
## # A tibble: 10 x 3
## # Groups: Amount.Borrowed [8]
## Amount.Borrowed Race Proportion
## <fct> <chr> <dbl>
## 1 All Amounts Asian 0.04
## 2 < $10000 Asian 0.03
## 3 $10001 to $20000 Asian 0.04
## 4 $20001 to $40000 Asian 0.04
## 5 $40001 to $60000 Asian 0.03
## 6 $60001 to $80000 Asian 0.04
## 7 $80001 to $100000 Asian 0.05
## 8 $60001 to $80000 Other 0.04
## 9 $80001 to $100000 Other 0.05
## 10 > $100000 Other 0.06
The races Asian and Other are tied for the lowest proportion of loans at the range of $60,000+