#install.packages("Statamarkdown")
library(Statamarkdown)
<- "C:/Program Files/Stata18/StataSE-64.exe"
stataexe ::opts_chunk$set(engine.path=list(stata=stataexe)) knitr
Impact of Financial Support on Academic performance of Students in the Msc Epidemiology and Public health (Honours) Programs at Wits University
Group 8 Stata Assignment
Setup the Stata Engine in R
- this involves loading the package
Statamarkdown
- defining the Stata installation path
Load your dataset
Loading data for First form
- Loading the Instrument Student Information dataset from the first form
- this dataset also contains all demographics for the students
/* load the data file from csv */
"Datasets\Academic_Performance.csv"
import delimited
save "Dofiles\Academic_Performance.dta"
(encoding automatically selected: ISO-8859-1)
(11 vars, 51 obs)
file Dofiles\Academic_Performance.dta already exists
r(602);
r(602);
Load dataset for second form
- load the academic performance dataset from the second form
/* load the data file from csv */
clear
"Datasets\Student_Information.csv"
import delimited save "Dofiles\Student_Information.dta" , replace
(encoding automatically selected: ISO-8859-1)
(41 vars, 51 obs)
file Dofiles\Student_Information.dta saved
Merging data (1:1 and/or 1:m) and appending data sets.
Merging datasets together
- the two datasets from the two forms are merged using
inner join
or one to one merging using the `record_id`` Id column.
/*Merging the two datasets (student information and academic performance datasets)*/
use "Dofiles\Academic_Performance.dta"
merge 1:1 record_id using "Dofiles\Student_Information.dta"
save "Datasets\Combined_datasets.dta" , replace
> sets)*/
Result Number of obs
-----------------------------------------
Not matched 0
Matched 51 (_merge==3)
-----------------------------------------
file Datasets\Combined_datasets.dta saved
Appending an additional dataset to the merged data
- Append the merged dataset with 4 new observations (Students who responded late)
/*Merging the two datasets (student information and academic performance datasets)*/
use "Datasets\Combined_datasets.dta"
append using "Datasets\Tobeappended.dta"
save "Datasets\Combined_datasets.dta" , replace
(label _merge already defined)
file Datasets\Combined_datasets.dta saved
Converting dates from either string or different columns to Stata friendly dates.
Explore the date of birth (dob) variable
use "Datasets\Combined_datasets.dta"
list dob in 1/5
| dob |
|------------|
1. | 1999-01-13 |
2. | 2002-12-05 |
3. | 2000-04-27 |
4. | 1997-08-28 |
5. | 2002-08-16 |
+------------+
Comment + the variable is not in the format that stata understands hence we need to change it to a date format since it is a string
Changing to proper date format
/*Change the string format to date */
use "Datasets\Combined_datasets.dta",clear
generate _date_ = date(dob,"YMD")
format _date_ %dM_d,_CY
drop dob
rename _date_ dob
save "Datasets\Combined_datasets.dta" , replace
file Datasets\Combined_datasets.dta saved
- checking for consistency
/*Compare the new variable*/
use "Datasets\Combined_datasets.dta"
list record_id name dob in 1/5
| record~d name dob |
|----------------------------------------|
1. | 1 Pontso January 13, 1999 |
2. | 3 Thandeka December 5, 2002 |
3. | 4 Mojaki April 27, 2000 |
4. | 6 karabo August 28, 1997 |
5. | 7 Fezeka August 16, 2002 |
+----------------------------------------+
Demonstrate exporting and importing data back into Stata using any two data formats
of your choice.
*Exporting the dataset* use "Datasets\Combined_datasets.dta",clear
export delimited "Datasets\Exported_Combined.dta"
*Importing the dataset back into stata*"Datasets\Exported_Combined.dta", clear import delimited
file Datasets\Exported_Combined.dta already exists
r(602);
r(602);
Saving data sets, labelling variables and creating value labels and making notes on dataset.
- label the variables Gender , socioeconomic_background, residential location
**********************************DATA CLEANING**************************************************************
*Labeling variables Gender, socioeconomic_background, residential location*
*Labeling gender*use "Datasets\Combined_datasets.dta"
label define gender 1 "Female" 2 "Male"
label value gender gender
background*
*Label socioeconomic
"Low-income household ( R0 - R100 000 per annum)" 2 "Middle-incomehousehold (R100 000 - R500 000 per annum)" 3 "High-income household (>R500 000 per annum)"
lab def socioeconomic_background 1
lab val socioeconomic_background socioeconomic_background
*Label residential location*
define residential_location 1"Eastern Cape" 2"Free State" 3"Gauteng" 4"KwaZulu-Natal" 5"Limpopo" 6"Mpumalanga" 7"Northern Cape" 8"North West" 9"Western Cape"
lab value residential_location residential_location
lab
save "Datasets\Combined_datasets.dta" , replace
file Datasets\Combined_datasets.dta saved
Create tables to check for consistency
use "Datasets\Combined_datasets.dta"
tab residential_location
tab gender
tab socioeconomic_background
residential_l |
ocation | Freq. Percent Cum.
--------------+-----------------------------------
Eastern Cape | 7 12.73 12.73
Free State | 2 3.64 16.36
Gauteng | 32 58.18 74.55
KwaZulu-Natal | 6 10.91 85.45
Limpopo | 4 7.27 92.73
Mpumalanga | 2 3.64 96.36
North West | 1 1.82 98.18
Western Cape | 1 1.82 100.00
--------------+-----------------------------------
Total | 55 100.00
gender | Freq. Percent Cum.
------------+-----------------------------------
Female | 39 70.91 70.91
Male | 16 29.09 100.00
------------+-----------------------------------
Total | 55 100.00
socioeconomic_background | Freq. Percent Cum.
----------------------------------------+-----------------------------------
Low-income household ( R0 - R100 000 pe | 29 52.73 52.73
Middle-incomehousehold (R100 000 - R500 | 21 38.18 90.91
High-income household (>R500 000 per an | 5 9.09 100.00
----------------------------------------+-----------------------------------
Total | 55 100.00
Renaming variables in stata
- renaming variables in stata using the code
rename oldname newname
use "Datasets\Combined_datasets.dta"
*Renaming variables*rename languages___1 Afrikaans
rename languages___2 English
rename languages___3 Isixhosa
rename languages___4 Isizulu
rename languages___5 IsiNdebele
rename languages___6 Sesotho
rename languages___7 Setswana
rename languages___8 Siswati
rename languages___9 Tshivenda
rename languages___10 Xitsonga
rename languages___11 Sepedi
rename languages___12 SignLanguage
*labeling variables*label define Afrikaans 1"yes" 0"No", replace
label value Afrikaans Afrikaans
label define English 1"yes" 0"No", replace
label value English English
label define Isixhosa 1"yes" 0"No", replace
label value Isixhosa
label define Isizulu 1"yes" 0"No", replace
label value Isizulu Isizulu
label define IsiNdebele 1"yes" 0"No", replace
label value IsiNdebele IsiNdebele
label define Sesotho 1"yes" 0"No", replace
label value Sesotho Sesotho
label define Setswana 1"yes" 0"No", replace
label value Setswana Setswana
label define Siswati 1"yes" 0"No", replace
label value Siswati Siswati
label define Tshivenda 1"yes" 0"No", replace
label value Tshivenda Tshivenda
label define Xitsonga 1"yes" 0"No", replace
label value Xitsonga Xitsonga
label define Sepedi 1"yes" 0"No", replace
label value Sepedi Sepedi
label define SignLanguage 1"yes" 0"No", replace
label value SignLanguage SignLanguage
save "Datasets\Combined_datasets.dta" , replace
file Datasets\Combined_datasets.dta saved
Checking the language variables for consistency
use "Datasets\Combined_datasets.dta"
list name Afrikaans English Isizulu IsiNdebele IsiNdebele in 1/5
| name Afrika~s English Isizulu IsiNde~e IsiNde~e |
|---------------------------------------------------------------|
1. | Pontso No yes No No No |
2. | Thandeka No yes yes No No |
3. | Mojaki No yes No No No |
4. | karabo No yes No No No |
5. | Fezeka No yes yes No No |
+---------------------------------------------------------------+
Generate a new variable called language
use "Datasets\Combined_datasets.dta"
gen language=.
replace language=1 if Afrikaans==1
replace language=3 if Isixhosa==1
replace language=4 if Isizulu ==1
replace language=5 if IsiNdebele==1
replace language=6 if Sesotho==1
replace language=7 if Setswana ==1
replace language=8 if Siswati==1
replace language=9 if Tshivenda==1
replace language=10 if Xitsonga ==1
replace language=11 if Sepedi ==1
replace language=12 if SignLanguage ==1
label define language 1 "Afrikaans" 2 "English" 3 " Isixhosa " 4 " Isizulu " 5 " IsiNdebele " 6 " Sesotho " 7"Setswana " 8 "Siswati" 9 "Tshivenda" 10 "Xitsonga" 11 "Sepedi" 12 "SignLanguage", replace
label value language language
save "Datasets\Combined_datasets.dta" , replace
(55 missing values generated)
(5 real changes made)
(15 real changes made)
(24 real changes made)
(2 real changes made)
(13 real changes made)
(7 real changes made)
(4 real changes made)
(2 real changes made)
(1 real change made)
(8 real changes made)
(0 real changes made)
file Datasets\Combined_datasets.dta saved
Checking the language variable for consistency
use "Datasets\Combined_datasets.dta"
list record_id language in 1/5
| record~d language |
|---------------------|
1. | 1 Sesotho |
2. | 3 Isizulu |
3. | 4 Sesotho |
4. | 6 Sepedi |
5. | 7 Isizulu |
+---------------------+
Label the Program variable
variable program*
*labeling the use "Datasets\Combined_datasets.dta"
label define program 1 "MSc Epidemiology" 2 "BHSc (Honours) in Public Health"
label value program program
save "Datasets\Combined_datasets.dta" , replace
file Datasets\Combined_datasets.dta saved
Checking the Program variable for consistency
use "Datasets\Combined_datasets.dta"
list record_id name program in 1/10
| record~d name program |
|--------------------------------------------------------|
1. | 1 Pontso MSc Epidemiology |
2. | 3 Thandeka MSc Epidemiology |
3. | 4 Mojaki MSc Epidemiology |
4. | 6 karabo MSc Epidemiology |
5. | 7 Fezeka MSc Epidemiology |
|--------------------------------------------------------|
6. | 9 Akuphelo MSc Epidemiology |
7. | 10 Zandile MSc Epidemiology |
8. | 11 John MSc Epidemiology |
9. | 12 Nelisiwe MSc Epidemiology |
10. | 14 Isikelele BHSc (Honours) in Public Health |
+--------------------------------------------------------+
Generating and manipulating data using the commands/functions:
- gen, encode, egen, recode (with labels on the categories) to create one categorical variable in these three ways and show the results using different text and graphic summaries.
variable age*
*recoding the use "Datasets\Combined_datasets.dta"
recode age (min/18=1 "18 years and below") (19/25=2 "19 to 25 years") (26/32=3 "26 to 32 years") (33/40=4 "33 to 40 years") (41/max=5 "41 and above"), gen(Age_cat) label (agelabel)
using the egen command to generate a variable academic_score*
*egen academic_score_cat= cut( academic_score ), at(0, 49,65,75,100) label
label define academic_score_cat 0 "Fail" 1 "Third Class" 2 "Second Class" 3 "First Class", replace
save "Datasets\Combined_datasets.dta" , replace
(55 differences between age and Age_cat)
(2 missing values generated)
file Datasets\Combined_datasets.dta saved
Check missing data using two or three approaches and explain the outputs or
missingness.
use "Datasets\Combined_datasets.dta"
missing values within the dataset using 3 methods*
*Checking the summarize
misstable of missing values for each variable in the dataset.*
*Provides a summary
misstable patternpatterns of missing values across multiple variables.*
*Displays
misstable treemissing values are related across variables.* *Creates a decision tree that shows how
Obs<.
+------------------------------
| | Unique
Variable | Obs=. Obs>. Obs<. | values Min Max
-------------+--------------------------------+------------------------------
redcap_sur~r | 55 0 | 0 . .
academic_s~e | 2 53 | 20 5 95
attendence~s | 2 53 | 17 5 50
attendance~e | 2 53 | 17 10 100
research_a~s | 2 53 | 2 0 1
workshops_~d | 6 49 | 10 0 9
financial_~s | 2 53 | 2 0 1
other_resp~s | 55 0 | 0 . .
enrolment_~r | 3 52 | 8 2 2028
other_resp~e | 55 0 | 0 . .
sufficienc~e | 27 28 | 8 2 10
financial_~t | 29 26 | 12 1000 15000
income_range | 26 29 | 6 1 6
travel_mode | 32 23 | 5 1 5
other_repl~s | 55 0 | 0 . .
commute_le~h | 3 52 | 15 5 90
language | 6 49 | 10 1 11
academic_s~t | 2 53 | 4 0 3
-----------------------------------------------------------------------------
Missing-value patterns
(1 means complete)
| Pattern
Percent | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
------------+-------------------------------------------------------
<1% | 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
| 1 1
|
38 | 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0
| 0 0
|
33 | 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0
| 0 0
|
5 | 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 0
| 0 0
|
4 | 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0
| 0 0
|
4 | 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0
| 0 0
|
4 | 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
| 0 0
|
2 | 1 1 1 1 1 1 0 1 1 1 0 0 0 1 0 0
| 0 0
|
2 | 1 1 1 1 1 1 1 0 0 1 0 0 0 1 0 0
| 0 0
|
2 | 1 1 1 1 1 1 1 0 1 1 0 0 0 1 0 0
| 0 0
|
2 | 1 1 1 1 1 1 1 0 1 1 1 0 1 0 0 0
| 0 0
|
2 | 1 1 1 1 1 1 1 1 0 1 1 1 0 0 0 0
| 0 0
|
2 | 1 1 1 1 1 1 1 1 1 0 1 0 1 0 0 0
| 0 0
|
2 | 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0
| 0 0
|
------------+-------------------------------------------------------
100% |
Variables are
Row 1: (1) academic_score (2) academic_score_cat
(3) attendance_rate (4) attendence_days
(5) financial_constraints (6) research_activities
(7) commute_length (8) enrolment_year (9) workshops_attended
(10) language (11) income_range (12) sufficiency_scale
(13) financial_support (14) travel_mode (15) other_replies
(16) other_response
Row 2: (1) other_responses (2) redcap_survey_identifier
(only 7 variables shown)
Nested pattern of missing values
other_~ies other_re~e other_~ses redcap_s~r travel_m~e financia~t sufficie~e
----------------------------------------------------------------------------
100% 100% 100% 100% 58% 11% 4%
7
47 4
44
42 42 42
0
0 0
0
0 0 0 0
0
0 0
0
0 0 0
0
0 0
0
0 0 0 0 0
0
0 0
0
0 0 0
0
0 0
0
0 0 0 0
0
0 0
0
0 0 0
0
0 0
0
0 0 0 0 0 0
0
0 0
0
0 0 0
0
0 0
0
0 0 0 0
0
0 0
0
0 0 0
0
0 0
0
0 0 0 0 0
0
0 0
0
0 0 0
0
0 0
0
0 0 0 0
0
0 0
0
0 0 0
0
0 0
0
0 0 0 0 0 0 0
0
0 0
0
0 0 0
0
0 0
0
0 0 0 0
0
0 0
0
0 0 0
0
0 0
0
0 0 0 0 0
0
0 0
0
0 0 0
0
0 0
0
0 0 0 0
0
0 0
0
0 0 0
0
0 0
0
0 0 0 0 0 0
0
0 0
0
0 0 0
0
0 0
0
0 0 0 0
0
0 0
0
0 0 0
0
0 0
0
0 0 0 0 0
0
0 0
0
0 0 0
0
0 0
0
0 0 0 0
0
0 0
0
0 0 0
0
0 0
0
----------------------------------------------------------------------------
(percent missing listed first)
Introduce duplicates and errors, then check the data for duplicates and clean data of
duplicates and any errors.
clean data of duplicates and any errors*
*Introducing a duplicate and use "Datasets\Combined_datasets.dta"
preserve
keep if student_id == 2751133
tempfile duplicate_copy
save "duplicate_copy"
restore
append using "duplicate_copy"
for duplicates*
*checking
duplicates list student_id record_id name surname
missing value error*
*Introduction a replace student_id = . if record_id == 4
replace student_id = . if record_id == 10
for a missing value error*
*checking summarize student_id
misstable
data for introduced duplicates*
*Cleaning duplicates drop student_id record_id, force
data for missing value error*
*Cleaning replace student_id = 2751133 if record_id == 4
replace student_id = 2331203 if record_id == 10
save "Datasets\Combined_datasets.dta" , replace
(54 observations deleted)
file duplicate_copy.dta already exists
r(602);
r(602);
Use of the dtable or other appropriate commands for Bivariate (Table 1) results and/or
even regression (without any explanation), save the word based table as a separate attachment
for Bivariate (Table 1) results and/or even regression*
* dtable
use "Datasets\Combined_datasets.dta"
by(program, tests testnotes) continuous(age attendence_days attendance_rate ) factor(gender residential_location income_range) dtable,
file duplicate_copy.dta already exists
r(602);
note: using test regress across levels of program for age, attendence_days,
and attendance_rate.
note: using test pearson across levels of program for gender,
residential_location, and income_range.
-------------------------------------------------------------------------------------------
program
MSc Epidemiology BHSc (Honours) in Public Health Total Test
-------------------------------------------------------------------------------------------
N 31 (56.4%) 24 (43.6%) 55 (100.0%)
age 26.613 (8.011) 23.083 (2.535) 25.073 (6.443) 0.043
attendence_days 37.933 (13.419) 39.957 (8.031) 38.811 (11.346) 0.525
attendance_rate 75.867 (26.838) 79.913 (16.062) 77.623 (22.692) 0.525
gender
Female 22 (71.0%) 17 (70.8%) 39 (70.9%) 0.991
Male 9 (29.0%) 7 (29.2%) 16 (29.1%)
residential_location
Eastern Cape 3 (9.7%) 4 (16.7%) 7 (12.7%) 0.097
Free State 2 (6.5%) 0 (0.0%) 2 (3.6%)
Gauteng 20 (64.5%) 12 (50.0%) 32 (58.2%)
KwaZulu-Natal 2 (6.5%) 4 (16.7%) 6 (10.9%)
Limpopo 4 (12.9%) 0 (0.0%) 4 (7.3%)
Mpumalanga 0 (0.0%) 2 (8.3%) 2 (3.6%)
North West 0 (0.0%) 1 (4.2%) 1 (1.8%)
Western Cape 0 (0.0%) 1 (4.2%) 1 (1.8%)
income_range
1 4 (28.6%) 8 (53.3%) 12 (41.4%) 0.582
2 3 (21.4%) 4 (26.7%) 7 (24.1%)
3 1 (7.1%) 1 (6.7%) 2 (6.9%)
4 3 (21.4%) 1 (6.7%) 4 (13.8%)
5 2 (14.3%) 1 (6.7%) 3 (10.3%)
6 1 (7.1%) 0 (0.0%) 1 (3.4%)
-------------------------------------------------------------------------------------------
Drawing four different types of graphs and place them on one graph.
use "Datasets\Combined_datasets.dta"
*Plotting Graphs*
histogram age ,title("Histogram of Age") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph1 , replace)
graph hbar, over(residential_location) name(graph2 , replace)
graph pie, over(program) title(Proportion By Program) name(graph3 , replace)
graph box age, over(gender) name(graph4 ,replace)
graph combine graph1 graph2 graph3 graph4,title("Four combined graphs") saving("all_graphs.gph" ,replace)
quietly graph export all_graphs.svg, replace
file duplicate_copy.dta already exists
r(602);
(bin=7, start=18, width=5)
file all_graphs.gph saved
Question 3: Drawing a map in Stata to display data
stats SA population by districts for 2018*****
****Load "Datasets\statssa_population_districts_2018.csv_", clear
import delimited rename adm2_id ADM2_ID
save "Datasets\pop_districts_final.dta", replace
*****loading the shapefiles*****using "Datasets\zaf_admbnda_adm2_sadb_ocha_20201109", database(SA_counties) coordinates(coord_SA) genid(county_id) replace shp2dta
file duplicate_copy.dta already exists
r(602);
(encoding automatically selected: UTF-8)
(30 vars, 52 obs)
file Datasets\pop_districts_final.dta saved
type: 5
*****Merge the datasets*********use "SA_counties.dta", clear
merge 1:1 ADM2_ID using "Datasets\pop_districts_final"
drop _merge
cap
graph bar (asis) p_total ,over(ADM1_EN) asyvars over(ADM2_EN , sort(p_total) lab(angle(90))) ysize(10) nofill legend(pos(3) col(1)) xsize(22)
quietly graph export graph_hbar.svg, replace
***Draw the map****
using "coord_SA.dta", id(county_id) legend(on) fcolor(Greens2) legend(label(2 "74247-502821") label(3 "502821-811400") label(4 "811400,1164473.5" ) label(5 "31164473.5,4949347" ) )
spmap p_total /*specify base map (kencoord) and variable identifying relevant geographic units (county_id)*/
/*change default labels (just cosmetics really) */
quietly graph export graph_map1.svg, replace
file duplicate_copy.dta already exists
r(602);
Result Number of obs
-----------------------------------------
Not matched 0
Matched 52 (_merge==3)
-----------------------------------------