The solution for Task 1a and 1b must be programmatic and not resolved using copy/paste. You may use programs such as Access, SQL, SPSS or SAS. Save the program and logic you used to complete these tasks with your final files. You could also include your assumptions and explanations of your methodology if you would like.
library(kableExtra)# To create tables in R markdown files
library(tidyverse)# To manipulate data files
library(ggplot2)# To create Plots and Diagrams
library(cowplot)# To put tables together
library(stringi)# To search for patter midst characters string/text/natural language
library(stringr) # for data cleaning and correct implementation of common string manipulation
library(readxl)# To read excel files
library(tidyr)# To simplify the process of creating tidy data, which is defined as a stadard way of storing data that is used wherever possible throughout the tidyverse.
Table 1:
table1 <- data.frame(
Field_Name = c("SchoolCode","CourseCd","ClassroomCd","StudentCount"),
Type = c("String","String","String","Numeric"),
Length = c("6","6","25","3"),
Description = c("Code that uniquely identifies a school","Code that uniquely identifies a course","Code that uniquely identifies a specific classroom in the school. This code is unique by SchoolCode","Number of students in the classroom")
)
kbl(table1)%>%
kable_classic_2(full_width=F)%>%
column_spec(1, border_right=T)%>%
column_spec(2, border_right=T)%>%
column_spec(3, border_right=T)%>%
column_spec(4)
| Field_Name | Type | Length | Description |
|---|---|---|---|
| SchoolCode | String | 6 | Code that uniquely identifies a school |
| CourseCd | String | 6 | Code that uniquely identifies a course |
| ClassroomCd | String | 25 | Code that uniquely identifies a specific classroom in the school. This code is unique by SchoolCode |
| StudentCount | Numeric | 3 | Number of students in the classroom |
Task 1a: Create a three-year view of total reported enrollment in one report using the separate October files provided. This report will be used by individuals need to examine the trends.
# Uploading 2012 data in R
data_2012 <- read_excel("ERA Task_October_2012.xlsx", range = "C16:E141")
# Giving the Column Names
colnames(data_2012) <- c("LEA", "District_Agency_Name", "TRS")
# Checking 2012 data
names(data_2012)
[1] "LEA" "District_Agency_Name" "TRS"
# Uploading 2013 data in R
data_2013 <- read_excel("ERA Task_October_2013.xlsx", range = "A14:C142")
# Giving the Column Names
colnames(data_2013) <- c("LEA", "District_Agency_Name", "TRS")
# Subsetting the 2013 dataset to match it with 2012 & 2014 Data Files
#data_2013 <- select(data_2013, -Empty)
# Checking 2013 data
names(data_2013)
[1] "LEA" "District_Agency_Name" "TRS"
# Uploading 2014 data in R
data_2014 <- read_excel("ERA Task_October_2014.xlsx", range = "A15:C150")
colnames(data_2014) <- c("LEA", "District_Agency_Name", "TRS")
# Checking 2014 data
names(data_2014)
[1] "LEA" "District_Agency_Name" "TRS"
The datafiles have been uploaded to R. I have named the columns, and they are set appropriately. Dataset from the year 2013 had second and third columns in Excel Spreadsheet merged together. R separates them as two different columns. Thus, the third column was not useful to me because it did not have any data. I got rid of the column. Now, the data set for all of the years (i.e., 2012, 2013, and 2014) have same column names. How do the data look? Let’s have a quick look. I want to check 10-rows of data per year.
# Year 2012
head((data_2012),10)
# A tibble: 10 x 3
LEA District_Agency_Name TRS
<chr> <chr> <dbl>
1 002 Allen Parish 4350
2 003 Ascension Parish 21090
3 004 Assumption Parish 3830
4 005 Avoyelles Parish 6060
5 006 Beauregard Parish 6160
6 007 Bienville Parish 2340
7 008 Bossier Parish 21680
8 009 Caddo Parish 41640
9 010 Calcasieu Parish 32690
10 011 Caldwell Parish 1650
# Year 2013
head((data_2013),10)
# A tibble: 10 x 3
LEA District_Agency_Name TRS
<chr> <chr> <chr>
1 002 Allen Parish 4368
2 003 Ascension Parish 21525
3 004 Assumption Parish 3892
4 005 Avoyelles Parish 5952
5 006 Beauregard Parish 6092
6 007 Bienville Parish 2279
7 008 Bossier Parish 22018
8 009 Caddo Parish 41252
9 010 Calcasieu Parish 32872
10 011 Caldwell Parish 1657
# Year 2014
head((data_2014),10)
# A tibble: 10 x 3
LEA District_Agency_Name TRS
<chr> <chr> <dbl>
1 002 Allen Parish 4370
2 003 Ascension Parish 21866
3 004 Assumption Parish 3729
4 005 Avoyelles Parish 5968
5 006 Beauregard Parish 6013
6 007 Bienville Parish 2274
7 008 Bossier Parish 22206
8 009 Caddo Parish 41146
9 010 Calcasieu Parish 32595
10 011 Caldwell Parish 1763
2013 data was troublesome. I went ahead and created a user defined function (UDF) in Excel to get rid of spaces and special characters in the dataset.
Function RemoveSpecChar(sInput As String) As String
Dim sSpecChar As String
Dim i As Long
sSpecChar = "/:*?™""®<>|.&@# (_+`©~);-+=^$!,’"
For i = 1 To Len(sSpecChar)
sInput = Replace\((sInput, Mid\)(sSpecChar, i, 1), "")
Next
RemoveSpecChar = sInput
End Function
We are able to see how data look for each of the study year. Although the data have similar column names, their types are not what they have to be. The “TRS” column is denoted as either “character” or “double”, but they should be numeric. District or Agency name on the other hand is denoted as character. It looks fine. Let’s change the “TRS” columns to numeric types.
# Year 2012
data_2012$TRS <- as.numeric(data_2012$TRS)
# Checking
class(data_2012$TRS)
[1] "numeric"
# Year 2013
data_2013$TRS <- as.numeric(data_2013$TRS)
# Checking
class(data_2013$TRS)
[1] "numeric"
# Year 2014
data_2014$TRS <- as.numeric(data_2014$TRS)
# Checking
class(data_2014$TRS)
[1] "numeric"
Now that the columns have been straight forward, I want compare these data set. For trend analyses, I need to merge these data files. I want to check if they have equal number of data points, if there is any specific ID that would help me combine these files, and if the ids’ match in all of these data set. For this I am going to draw the summary:
# Year 2012
summary(data_2012)
LEA District_Agency_Name TRS
Length:125 Length:125 Min. : 30
Class :character Class :character 1st Qu.: 520
Mode :character Mode :character Median : 2010
Mean : 5626
3rd Qu.: 6060
Max. :45940
# Year 2013
summary(data_2013)
LEA District_Agency_Name TRS
Length:128 Length:128 Min. : 39
Class :character Class :character 1st Qu.: 548
Mode :character Mode :character Median : 1978
Mean : 5539
3rd Qu.: 5626
Max. :46661
# Year 2014
summary(data_2014)
LEA District_Agency_Name TRS
Length:135 Length:135 Min. : 38.0
Class :character Class :character 1st Qu.: 527.5
Mode :character Mode :character Median : 1723.0
Mean : 5286.3
3rd Qu.: 5249.0
Max. :48126.0
The results show that, there have been total of 125 rows in the year 2012, which is 128 in 2013, and 135 in the year 2014. They unequal data points. I can see differential amount of missing data in these data set. However, I don’t much worry about them now. I am now going to merge this files in phases. Merge 2212 and 2013 first and merge 2014 afterward.
m1_data <- merge(data_2012, data_2013, by = "LEA")
#m1_data
The data set have been merged but R gave unique column names. I want to fix them first before I combine the data to 2014 data.
m1_data <- select(m1_data, LEA, District_Agency_Name.y, TRS.x, TRS.y)
head(m1_data)
LEA District_Agency_Name.y TRS.x TRS.y
1 002 Allen Parish 4350 4368
2 003 Ascension Parish 21090 21525
3 004 Assumption Parish 3830 3892
4 005 Avoyelles Parish 6060 5952
5 006 Beauregard Parish 6160 6092
6 007 Bienville Parish 2340 2279
I got rid of one of the repeated columns. I now want to merge 2014 data.
complete_data <- merge(m1_data, data_2014, by = "LEA")
#complete_data
Data have been merged. It looks like the ID’s match as well. I am now going to get rid of the repeated column and rename the columns for easy read of the combined data file.
complete_data <- select(complete_data, LEA, District_Agency_Name, TRS.x, TRS.y, TRS)
#head(complete_data)
# Overwriting the Column Names
colnames(complete_data) <- c("LEA", "District_Agency_Name", "Year_2012", "Year_2013", "Year_2014")
Task 1a: Report in a Single Table
# Some of the names of the schools are too long. I want to truncate them so that I can fit all the column in a single table side by side
library(stringr)
complete_data$District_Agency_Name <- lapply(complete_data$District_Agency_Name, str_trunc, 40, ellipsis = "")
complete_data
LEA District_Agency_Name Year_2012 Year_2013 Year_2014
1 002 Allen Parish 4350 4368 4370
2 003 Ascension Parish 21090 21525 21866
3 004 Assumption Parish 3830 3892 3729
4 005 Avoyelles Parish 6060 5952 5968
5 006 Beauregard Parish 6160 6092 6013
6 007 Bienville Parish 2340 2279 2274
7 008 Bossier Parish 21680 22018 22206
8 009 Caddo Parish 41640 41252 41146
9 010 Calcasieu Parish 32690 32872 32595
10 011 Caldwell Parish 1650 1657 1763
11 012 Cameron Parish 1290 1289 1306
12 013 Catahoula Parish 1520 1477 1465
13 014 Claiborne Parish 1960 1746 1728
14 015 Concordia Parish 3820 3590 3557
15 016 DeSoto Parish 5240 5145 5122
16 017 East Baton Rouge Parish 42720 42375 42618
17 018 East Carroll Parish 1150 1114 1098
18 019 East Feliciana Parish 2010 1994 2004
19 020 Evangeline Parish 6150 6184 6160
20 021 Franklin Parish 3210 3154 3097
21 022 Grant Parish 3320 3173 3216
22 023 Iberia Parish 13960 14099 14048
23 024 Iberville Parish 4610 4697 4735
24 025 Jackson Parish 2300 2345 2359
25 026 Jefferson Parish 45940 46661 48126
26 027 Jefferson Davis Parish 5930 5946 5871
27 028 Lafayette Parish 30810 31171 30205
28 029 Lafourche Parish 14690 14787 14875
29 030 LaSalle Parish 2640 2673 2641
30 031 Lincoln Parish 6650 6615 6628
31 032 Livingston Parish 25550 25826 25992
32 033 Madison Parish 1880 1492 1417
33 034 Morehouse Parish 4430 4569 4449
34 035 Natchitoches Parish 6780 6774 6565
35 036 Orleans Parish 11360 12541 13352
36 037 Ouachita Parish 20170 20130 20166
37 038 Plaquemines Parish 3950 4066 4105
38 039 Pointe Coupee Parish 2940 2916 3114
39 040 Rapides Parish 24190 24288 23840
40 041 Red River Parish 1480 1538 1494
41 042 Richland Parish 3470 3338 3195
42 043 Sabine Parish 4360 4405 4382
43 044 St. Bernard Parish 6750 7189 7466
44 045 St. Charles Parish 9790 9727 9757
45 046 St. Helena Parish 790 805 1126
46 047 St. James Parish 3830 3865 3868
47 048 St. John the Baptist Parish 6100 6031 5951
48 049 St. Landry Parish 14880 14919 14580
49 050 St. Martin Parish 8420 8345 8419
50 051 St. Mary Parish 9500 9434 9401
51 052 St. Tammany Parish 37880 38044 38193
52 053 Tangipahoa Parish 19890 20093 20041
53 054 Tensas Parish 680 671 653
54 055 Terrebonne Parish 18920 18799 18657
55 056 Union Parish 2370 2269 2205
56 057 Vermilion Parish 9530 9599 9593
57 058 Vernon Parish 9870 9716 9372
58 059 Washington Parish 5470 5409 5376
59 060 Webster Parish 6710 6665 6539
60 061 West Baton Rouge Parish 3930 3968 3988
61 062 West Carroll Parish 2210 2210 2126
62 063 West Feliciana Parish 2160 2186 2144
63 064 Winn Parish 2510 2545 2480
64 065 City of Monroe School District 8680 8557 8438
65 066 City of Bogalusa School District 2140 1962 1757
66 067 Zachary Community School District 5360 5519 5393
67 068 City of Baker School District 1750 1673 1459
68 069 Central Community School District 4330 4418 4585
69 101 Special School District 510 478 425
70 300 RSD-New Beginnings Schools Foundation 2010 2019 2107
71 302 Louisiana School For Math Science & the 300 298 308
72 304 LA Schools for the Deaf and Visually Imp 210 207 193
73 306 Louisiana Special Education Center 30 39 38
74 318 LSU Laboratory School 1370 1398 1414
75 319 Southern University Lab School 400 555 1040
76 321 New Vision Learning Academy 370 376 383
77 328 Southwest Louisiana Charter Acad. Founda 560 679 967
78 329 V. B. Glencoe Charter School 370 367 377
79 331 International School of Louisiana 740 826 908
80 333 Avoyelles Public Charter School 690 722 721
81 334 New Orleans Center for Creative Arts 110 177 242
82 336 Delhi Charter School 670 924 832
83 337 Belle Chasse Academy, Inc. 970 946 933
84 339 Milestone Academy 410 336 428
85 340 The MAX Charter School 100 116 110
86 341 D'Arbonne Woods Charter School 560 724 806
87 343 Community School for Apprenticeship Lear 1590 2071 2246
88 344 Voices for International Business & Educ 410 476 546
89 345 Louisiana Connections Academy 1200 1200 1723
90 346 Lake Charles Charter Academy Foundation, 770 872 860
91 347 Lycee Francais de la Nouvelle-Orleans 340 400 467
92 348 New Orleans Military/Maritime Academy 210 360 540
93 349 Outreach Community Development Corporati 170 196 220
94 360 RSD-Educators for Quality Alternatives 150 159 164
95 361 RSD-Crescent Leadership Academy 150 227 162
96 363 RSD-Crescent City Schools 520 924 1015
97 364 RSD-Community Leaders Advocating Student 550 575 597
98 366 RSD-Lagniappe Academies of New Orleans 130 162 180
99 367 RSD-Spirit of Excellence Academy 370 355 387
100 368 RSD-Morris Jeff Community School 310 412 522
101 369 RSD-ReNEW-Reinventing Education, Inc. 2430 3332 3432
102 371 Shreveport Charter School, Inc. 520 527 533
103 373 RSD-Arise Academy 380 905 848
104 374 RSD-Success Preparatory Academy 410 457 506
105 381 RSD-Akili Academy of New Orleans 380 472 487
106 382 RSD-Collegiate Academies 580 854 1052
107 384 RSD-Miller-McCoy Academy for Math and Bu 380 325 365
108 385 RSD-New Orleans College Preparatory Acad 860 1173 1276
109 388 RSD-Broadmoor Charter School Board 610 647 628
110 389 RSD-Pelican Educational Foundation 520 566 564
111 390 RSD-Dryades YMCA 640 570 514
112 391 RSD-Friends of King 1140 1209 1229
113 392 RSD-New Orleans Charter Schools Foundati 420 447 448
114 393 RSD-Choice Foundation 1840 1874 1861
115 395 RSD-Algiers Charter Schools Association 3570 4129 4450
116 396 Recovery School District-LDE 6720 3465 194
117 397 RSD-Institute for Academic Excellence 480 470 399
118 398 RSD-Knowledge is Power Program (KIPP) N. 3210 3755 4022
119 399 RSD-FirstLine Schools, Inc. 2560 2802 3114
120 A02 Office of Juvenile Justice 190 288 284
As can be seen, there are 120 data points. They all belong to different school districts. It is a long list, thus, things are not as clear. I don’t need the summary of the first two columns because they simply represent the District number and district name. Three year summary would provide us with some crucial information. Let’s see the summary of the data:
summary(complete_data[,-(1:2)])
Year_2012 Year_2013 Year_2014
Min. : 30.0 Min. : 39 Min. : 38.0
1st Qu.: 557.5 1st Qu.: 665 1st Qu.: 646.8
Median : 2185.0 Median : 2198 Median : 2135.0
Mean : 5845.9 Mean : 5887 Mean : 5887.4
3rd Qu.: 6112.5 3rd Qu.: 5972 3rd Qu.: 5955.2
Max. :45940.0 Max. :46661 Max. :48126.0
Now, we see the a much better picture. The school with the minimum number of students have been changed over the years. In addition, the mean number of students per school has been increasing, as well. The change, however, is not a big one. Schools seem to have higher enrollment over the years. Now, lets order a density plot and compare:
ggplot()+
geom_freqpoly(data=complete_data, aes(Year_2012, ..density..), fill="blue", color="darkblue")+
geom_freqpoly(data=complete_data, aes(Year_2013, ..density..), fill="red", color="darkred")+
geom_freqpoly(data=complete_data, aes(Year_2014, ..density..), fill="purple", color="purple")+ xlab("Number of Students")
We can see that most of the schools have less than 10,000 students. Number of schools having lower than 10,000 students have been decreasing over time.
This is not the final thing I wanted my data to spell. I would thus, change the wide data into the long data to do a trend analysis. First of all, let’s change the data outlook. I want to create a new variable that will subsume 2012, 2013, and 2014 student numbers as subcategories. I will call it “Number_of_students” and my key will be the name of the school districts/agency.
long_data <- gather(data = complete_data, key = Year, value = Number_of_Students,
Year_2012:Year_2013:Year_2014, factor_key = TRUE)
head(arrange(long_data, LEA), 10)
LEA District_Agency_Name Year Number_of_Students
1 002 Allen Parish Year_2012 4350
2 002 Allen Parish Year_2013 4368
3 002 Allen Parish Year_2014 4370
4 003 Ascension Parish Year_2012 21090
5 003 Ascension Parish Year_2013 21525
6 003 Ascension Parish Year_2014 21866
7 004 Assumption Parish Year_2012 3830
8 004 Assumption Parish Year_2013 3892
9 004 Assumption Parish Year_2014 3729
10 005 Avoyelles Parish Year_2012 6060
I can see that my new dataset has only four columns that includes two new columns, i.e., Year and Number_of_Students. Same school district has three data points each representing year 2012 through 2014.
Now lets see the student number time-series:
library(plotly)
a <- ggplot()+
geom_line(data = long_data, aes(x = Number_of_Students, y = Year, group = Number_of_Students, colour = Number_of_Students))
ggplotly(a)
Task 1b: The ClassFile.txt contains a list of all courses taught in all schools along with the number of students in each course.
Using the CourseCd_with_Names.xlsx file, add the names of the courses to the Class File.
How many classes of Algebra II are being taught by school? The finished product should be in the following layout.
table2 <- data.frame(
SchoolCode = c("-","-"),
Numbe_of_Algebra_II_Classrooms = c("-", "-")
)
kbl(table2)%>%
kable_classic_2(full_width=F)%>%
column_spec(1, border_right=T)%>%
column_spec(2)
| SchoolCode | Numbe_of_Algebra_II_Classrooms |
|---|---|
|
|
|
|
|
|
# Third Table
table3 <- data.frame(
CourseName = c("-", "-"),
ALGEBRA_II = c("-", "-")
)
kbl(table3)%>%
kable_classic_2(full_width=F)%>%
column_spec(1, border_right=T)%>%
column_spec(2)
| CourseName | ALGEBRA_II |
|---|---|
|
|
|
|
|
|
table3 <- data.frame(
SchoolCode = c("-", "-"),
Average_Class_Size = c("-", "-")
)
kbl(table3)%>%
kable_classic_2(full_width=F)%>%
column_spec(1, border_right=T)%>%
column_spec(2)
| SchoolCode | Average_Class_Size |
|---|---|
|
|
|
|
|
|