SIMULATION OF SURVEY DATASETS

INTRODUCTION

SURVEY DATA

Survey data is defined as the resultant data that is collected from a sample of respondents that took a survey. This data is comprehensive information gathered from a target audience about a specific topic to conduct research. There are many methods used for survey data collection and statistical analysis.

Various channels are used to collect feedback and opinions from the desired sample of individuals. While conducting survey research, researchers prefer multiple sources to gather data such as online surveys, telephonic surveys, face-to-face surveys, etc.

However, the medium of collecting survey data decides the sample of people that are to be reached out to, to reach the requisite number of survey responses.

Factors of collecting survey data such as how the interviewer will contact the respondents (online or offline), how the information is communicated to the respondents etc. decide the effectiveness of gathered data.

SURVEY DATA COLLECTION METHODS

The methods used to collect survey data have evolved with the change in technology. From face-to-face surveys, telephonic surveys to now online and email surveys, the world of survey data collection has changed with time. Each survey data collection method has its pros and cons, and every researcher has a preference for gathering accurate information from the target sample.

The survey response rates for each of these data collection methods will differ as their reach and impact are always different. Different ways are chosen according to specific target population characteristics and the intent to examine human nature under various situations.

There are four main survey data collection methods – Online Surveys, Face-to-Face Surveys, Telephone Surveys, and Paper Survey.

ONLINE SURVEYS

Online surveys are the most cost-effective and can reach the maximum number of people in comparison to the other mediums. The performance of these surveys is much more widespread than the other data collection methods. In situations where there is more than one question to be asked to the target sample, certain researchers prefer conducting online surveys over the traditional face-to-face or telephone surveys.

Online surveys are effective and therefore require computational logic and branching technologies for exponentially more accurate survey data collection versus any other traditional means of surveying. They are straightforward in their implementation and take a minimum time of the respondents.

The investment required for survey data collection using online surveys is also negligible in comparison to the other methods. The results are collected in real-time for researchers to analyze and decide corrective measures. A very good example of an online survey is a hotel chain using an online survey to collect guest satisfaction metrics after a stay or an event at the property.

FACE-T0-FACE SURVEYS

Gaining information from respondents via face-to-face medium is much more effective than the other mediums because respondents usually tend to trust the surveyors and provide honest and clear feedback about the subject in-hand. Researchers can easily identify whether their respondents are uncomfortable with the asked questions and can be extremely productive in case there are sensitive topics involved in the discussion.

This face-to-face data collection method demands more cost-investment than in comparison to the other methods. According to the geographic or psychographic segmentation, researchers must be trained to gain accurate information.

For example, a job evaluation survey is conducted in person between an HR or a manager with the employee. This method works best face-to-face as the data collection can collect as accurate information as possible.

TELEPHONE SURVEYS

Telephone surveys require much lesser investment than face-to-face surveys. Depending on the required reach, telephone surveys cost as much or a little more than online surveys. Contacting respondents via the telephonic medium requires less effort and manpower than the face-to-face survey medium. If interviewers are located at the same place, they can cross-check their questions to ensure error-free questions are asked to the target audience.

The main drawback of conducting telephone surveys is that establishing a friendly equation with the respondent becomes challenging due to the bridge of the medium. Respondents are also highly likely to choose to remain anonymous in their feedback over the phone as the reliability associated with the researcher can be questioned.

For example, if a retail giant would like to understand purchasing decisions, they can conduct a telephonic, motivation, and buying experience survey to collect data about the entire purchasing experience.

PAPER SURVEYS

The other commonly used survey method is paper surveys. These surveys can be used where laptops, computers, and tablets cannot go, and hence they use the age-old method of data collection; pen and paper. This method helps collect survey data in field research and helps strengthen the number of responses collected and the validity of these responses.

A popular example or use case of a paper survey is a fast food restaurant survey where the fast-food chain would like to collect feedback on the dining experience of its patrons (attendants).

TYPES OF SURVEY DATA BASED ON THE FREQUENCY AT WHICH THEY ARE ADMINISTERED

Surveys can be divided into 3 distinctive types on the basis of the frequency of their distribution. They are:

Cross-Sectional Surveys

Cross-sectional surveys are an observational research method that analyzes data of variables collected at one given point of time across a sample population or a pre-defined subset. The survey data from this method helps the researcher understand what the respondent is feeling at a certain point in time.

It helps measure opinions in a particular situation. For example, if the researcher would like to understand movie rental habits, a survey can be conducted across demographics and geographical locations.

The cross-sectional survey, for example, can help understand that males between 21-28 rent action movies and females between 35-45 rent romantic comedies.

Longitudinal Surveys

Longitudinal surveys are those surveys that help researchers to make an observation and collect data over an extended period of time. This survey data can be qualitative or quantitative in nature, and the survey creator does not interfere with the survey respondents.

For example, a longitudinal study can be carried out for years to help understand if mine workers are more prone to lung diseases. This study takes a year and discounts any pre-existing conditions.

Retrospective Surveys

In retrospective surveys, researchers ask respondents to report events from the past. This survey method offers in-depth survey data but doesn’t take as long to complete. By deploying this kind of survey, researchers can gather data based on past experiences and beliefs of people.

For example, if hikers are asked about a certain hike – the conditions of the hiking trail, ease of hike, weather conditions, trekking conditions, etc. after they have completed the trek, it is a retrospective study.

### SURVEY DATA ANALYSIS USING R LANGUAGE ###

After the survey data has been collected, this data has to be analyzed to ensure it aids towards the end research objective. There are different ways of conducting this research and some steps to follow. There are four main steps of survey data analysis:

Understand the most popular survey research questions: The survey questions should align with the overall purpose of the survey. That is when the collected data will be effective in helping researchers. For example, if a seminar has been conducted, the researchers will send out a post-seminar feedback survey. T he primary goal of this survey will be to understand whether the attendees are interested in attending future seminars. The question will be: “How likely are you to attend future seminars?”. Data collected for this question will decide the likelihood of success of future seminars.

Filter obtained results using the cross-tabulation technique: Understand the various categories in the target audience and their thoughts using cross-tabulation format. For example, if there are business owners, administrators, students, etc. who attend the seminar, the data about whether they would prefer attending future seminars or not can be represented using cross-tabulation.

Evaluate the derived numbers: Analyzing the gathered information is critical. How many of the attendees are of the opinion that they will be attending future seminars and how many will not – these facts need to be evaluated according to the results obtained from the sample.

Draw conclusions: Weave a story with the collected and analyzed data. What was the intention of the survey research, and how does the survey data suffice that objective? – Understand that and develop accurate, conclusive results.

###SURVEY DATA ANALYSIS METHODS###

Conducting a survey without having access to the resultant data and the inability to draw conclusions from the survey data is pointless. When you conduct a survey, it is imperative to have access to its analytics. It is tough to analyze using traditional survey methods like pen and paper and also requires additional manpower. Survey data analysis becomes much easier when using advanced online data collection methods with an online survey platform such as market research survey software or customer survey software like R and Python.

Statistical analysis can be conducted on the survey data to make sense of the data that has been collected. There are multiple data analysis methods of quantitative data. Some of the commonly used types are:

Cross-tabulation analysis
Trend analysis
MaxDiff analysis
Conjoint analysis
Total Unduplicated Reach and Frequency (TURF) analysis
Gap analysis
SWOT analysis: SWOT analysis, another widely used statistical method, organizes survey data into data that represents the strength, weaknesses, opportunities, and threats of an organization or product or service that provides a holistic picture of competition. This method helps to create effective business strategies.
Text analysis: Text analysis is an advanced statistical method where intelligent tools make sense of and quantify or fashion qualitative and open-ended data into easily understandable data. This method is used when the survey data is unstructured.

APPLICATION OF R PROGRAMMING LANGUAGE IN SURVEY DATA ANALYSIS

We want to describe and implement functions (commands) that aid the exploration of survey data via simple tabulations of respondent counts and proportions, including the ability to specify:

either a frequency count or a row/column/joint/total table proportion;
multiple row and column variables; and
all or grand margins or no margins plus retention of data in a format that is amendable to further analysis in R.

NOW, LET’S START THE CALCULATIONS

PRACTICAL ILLUSTRATION

We will simulate datasets, that will be approximately 90% accurate with real-life dataset. That will serve as our case study.

Install the following packages as follows:

install.packages("tidyverse")

## Warning: package 'tidyverse' is in use and will not be installed

install.packages("stringr")

## Warning: package 'stringr' is in use and will not be installed

install.packages("gmodels")

## Warning: package 'gmodels' is in use and will not be installed

#install.packages("ggplots")

install.packages("descr")

## Warning: package 'descr' is in use and will not be installed

# After installation, call their libraries one by one as follows:

Run the codes one by one, it will draw the applicability of the packages for usage:

library("tidyverse")

library("stringr")

library("gmodels")

#library("ggplots")

library("descr")

EXAMPLE 1:

TO CREATE SOME DEMOGRAPHIC DATA SETS

Run the following R codes to see what you observe:

ID = seq(1:3000)


  set.seed(234)
  Age = sample(c("0 - 5", "6 - 14", "15 - 24", "25 - 50", "51 - 64", "65+ "), 3000, replace = TRUE)
  #View(Age)


  set.seed(234)
  Gender = sample(c("Male", "Female"), 3000, replace = TRUE)
  #View(Gender)


  set.seed(234)
  Country = sample(c("Nigeria", "Ghana", "South Africa", "Botswana", "United Kingdom", "Austria"), 3000, replace = TRUE)
  #View(Country)


  set.seed(234)
  Health_Status = sample(c("Poor", "Fair", "Okay"), 3000, replace = TRUE) 
  #View(Health_Status)


  Survey = data.frame(Age, Gender, Country, Health_Status)
  #View(Survey)

  head(Survey)  ##Recall that head is used to pick the first 6 elements of the generated data set

##       Age Gender        Country Health_Status
## 1   0 - 5   Male        Nigeria          Poor
## 2  6 - 14   Male          Ghana          Okay
## 3    65+  Female        Austria          Fair
## 4  6 - 14 Female          Ghana          Fair
## 5  6 - 14 Female          Ghana          Fair
## 6 51 - 64 Female United Kingdom          Fair

  #Look at this code:
  
  head(Survey, 15)  #This will produce the first 15 elements of survey

##        Age Gender        Country Health_Status
## 1    0 - 5   Male        Nigeria          Poor
## 2   6 - 14   Male          Ghana          Okay
## 3     65+  Female        Austria          Fair
## 4   6 - 14 Female          Ghana          Fair
## 5   6 - 14 Female          Ghana          Fair
## 6  51 - 64 Female United Kingdom          Fair
## 7    0 - 5   Male        Nigeria          Poor
## 8  25 - 50 Female       Botswana          Poor
## 9  25 - 50   Male       Botswana          Okay
## 10    65+  Female        Austria          Fair
## 11    65+  Female        Austria          Okay
## 12 15 - 24   Male   South Africa          Okay
## 13  6 - 14 Female          Ghana          Fair
## 14  6 - 14   Male          Ghana          Okay
## 15 51 - 64   Male United Kingdom          Fair

  #What about the command below:

  tail(Survey)  # This picks the last 6 elements of the generated data set

##          Age Gender        Country Health_Status
## 2995 15 - 24   Male   South Africa          Okay
## 2996 15 - 24   Male   South Africa          Poor
## 2997 51 - 64 Female United Kingdom          Fair
## 2998  6 - 14 Female          Ghana          Poor
## 2999   0 - 5 Female        Nigeria          Okay
## 3000 25 - 50 Female       Botswana          Poor

  tail(Survey, 20)  # This picks the last 20 elements of the generated data set

##          Age Gender        Country Health_Status
## 2981   0 - 5 Female        Nigeria          Fair
## 2982  6 - 14 Female          Ghana          Okay
## 2983 15 - 24 Female   South Africa          Poor
## 2984  6 - 14   Male          Ghana          Poor
## 2985    65+    Male        Austria          Poor
## 2986   0 - 5 Female        Nigeria          Fair
## 2987 51 - 64 Female United Kingdom          Fair
## 2988 51 - 64 Female United Kingdom          Poor
## 2989  6 - 14   Male          Ghana          Okay
## 2990  6 - 14 Female          Ghana          Okay
## 2991 51 - 64   Male United Kingdom          Okay
## 2992 15 - 24   Male   South Africa          Poor
## 2993 25 - 50 Female       Botswana          Okay
## 2994   0 - 5 Female        Nigeria          Okay
## 2995 15 - 24   Male   South Africa          Okay
## 2996 15 - 24   Male   South Africa          Poor
## 2997 51 - 64 Female United Kingdom          Fair
## 2998  6 - 14 Female          Ghana          Poor
## 2999   0 - 5 Female        Nigeria          Okay
## 3000 25 - 50 Female       Botswana          Poor

###Now, let’s start workings with our “Survey” generated data set###

  result_1 = CrossTable(Survey$Age, Survey$Gender, expected = TRUE, chisq = TRUE, prop.chisq = TRUE, dnn = c("Age", "Gender"))

  print(result_1)

##    Cell Contents 
## |-------------------------|
## |                       N | 
## |              Expected N | 
## | Chi-square contribution | 
## |           N / Row Total | 
## |           N / Col Total | 
## |         N / Table Total | 
## |-------------------------|
## 
## =================================
##            Gender
## Age        Female    Male   Total
## ---------------------------------
## 0 - 5         253     227     480
##             246.7   233.3        
##             0.160   0.169        
##             0.527   0.473   0.160
##             0.164   0.156        
##             0.084   0.076        
## ---------------------------------
## 15 - 24       246     259     505
##             259.6   245.4        
##             0.709   0.750        
##             0.487   0.513   0.168
##             0.160   0.178        
##             0.082   0.086        
## ---------------------------------
## 25 - 50       235     263     498
##             256.0   242.0        
##             1.718   1.817        
##             0.472   0.528   0.166
##             0.152   0.180        
##             0.078   0.088        
## ---------------------------------
## 51 - 64       269     228     497
##             255.5   241.5        
##             0.718   0.759        
##             0.541   0.459   0.166
##             0.174   0.156        
##             0.090   0.076        
## ---------------------------------
## 6 - 14        268     232     500
##             257.0   243.0        
##             0.471   0.498        
##             0.536   0.464   0.167
##             0.174   0.159        
##             0.089   0.077        
## ---------------------------------
## 65+           271     249     520
##             267.3   252.7        
##             0.052   0.055        
##             0.521   0.479   0.173
##             0.176   0.171        
##             0.090   0.083        
## ---------------------------------
## Total        1542    1458    3000
##             0.514   0.486        
## =================================
## 
## Statistics for All Table Factors
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 = 7.876522      d.f. = 5      p = 0.163

To generate the same results in SPSS format, use the command below:

  result_1B = CrossTable(Survey$Age, Survey$Gender,prop.r=FALSE,prop.t=FALSE,prop.chisq=FALSE,format="SPSS")
  print(result_1B)

##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |          Column Percent | 
## |-------------------------|
## 
## ====================================
##               Survey$Gender
## Survey$Age    Female    Male   Total
## ------------------------------------
## 0 - 5           253     227     480 
##                16.4%   15.6%        
## ------------------------------------
## 15 - 24         246     259     505 
##                16.0%   17.8%        
## ------------------------------------
## 25 - 50         235     263     498 
##                15.2%   18.0%        
## ------------------------------------
## 51 - 64         269     228     497 
##                17.4%   15.6%        
## ------------------------------------
## 6 - 14          268     232     500 
##                17.4%   15.9%        
## ------------------------------------
## 65+             271     249     520 
##                17.6%   17.1%        
## ------------------------------------
## Total          1542    1458    3000 
##                51.4%   48.6%        
## ====================================

Now, let’s consider Age and Health Status:

  result_2 = CrossTable(Survey$Age, Survey$Health_Status, expected = TRUE, chisq = TRUE, prop.chisq = TRUE, dnn = c("Age", "Health_Status"))
  print(result_2)

##    Cell Contents 
## |-------------------------|
## |                       N | 
## |              Expected N | 
## | Chi-square contribution | 
## |           N / Row Total | 
## |           N / Col Total | 
## |         N / Table Total | 
## |-------------------------|
## 
## ========================================
##            Health_Status
## Age         Fair    Okay    Poor   Total
## ----------------------------------------
## 0 - 5        145     181     154     480
##            163.2   160.3   156.5        
##            2.030   2.668   0.039        
##            0.302   0.377   0.321   0.160
##            0.142   0.181   0.157        
##            0.048   0.060   0.051        
## ----------------------------------------
## 15 - 24      166     179     160     505
##            171.7   168.7   164.6        
##            0.189   0.633   0.130        
##            0.329   0.354   0.317   0.168
##            0.163   0.179   0.164        
##            0.055   0.060   0.053        
## ----------------------------------------
## 25 - 50      177     153     168     498
##            169.3   166.3   162.3        
##            0.348   1.069   0.197        
##            0.355   0.307   0.337   0.166
##            0.174   0.153   0.172        
##            0.059   0.051   0.056        
## ----------------------------------------
## 51 - 64      164     153     180     497
##            169.0   166.0   162.0        
##            0.147   1.018   1.995        
##            0.330   0.308   0.362   0.166
##            0.161   0.153   0.184        
##            0.055   0.051   0.060        
## ----------------------------------------
## 6 - 14       179     176     145     500
##            170.0   167.0   163.0        
##            0.476   0.485   1.988        
##            0.358   0.352   0.290   0.167
##            0.175   0.176   0.148        
##            0.060   0.059   0.048        
## ----------------------------------------
## 65+          189     160     171     520
##            176.8   173.7   169.5        
##            0.842   1.078   0.013        
##            0.363   0.308   0.329   0.173
##            0.185   0.160   0.175        
##            0.063   0.053   0.057        
## ----------------------------------------
## Total       1020    1002     978    3000
##            0.340   0.334   0.326        
## ========================================
## 
## Statistics for All Table Factors
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 = 15.34322      d.f. = 10      p = 0.12

For the immediate above, let’s have the same results in SPSS format:

  result_2B = CrossTable(Survey$Age, Survey$Health_Status,prop.r=FALSE,prop.t=FALSE,prop.chisq=FALSE,format="SPSS")
  print(result_2B)

##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |          Column Percent | 
## |-------------------------|
## 
## ===========================================
##               Survey$Health_Status
## Survey$Age     Fair    Okay    Poor   Total
## -------------------------------------------
## 0 - 5          145     181     154     480 
##               14.2%   18.1%   15.7%        
## -------------------------------------------
## 15 - 24        166     179     160     505 
##               16.3%   17.9%   16.4%        
## -------------------------------------------
## 25 - 50        177     153     168     498 
##               17.4%   15.3%   17.2%        
## -------------------------------------------
## 51 - 64        164     153     180     497 
##               16.1%   15.3%   18.4%        
## -------------------------------------------
## 6 - 14         179     176     145     500 
##               17.5%   17.6%   14.8%        
## -------------------------------------------
## 65+            189     160     171     520 
##               18.5%   16.0%   17.5%        
## -------------------------------------------
## Total         1020    1002     978    3000 
##               34.0%   33.4%   32.6%        
## ===========================================

See another example below:

  result_3 = CrossTable(Survey$Age, Survey$Country, expected = TRUE, chisq = TRUE, prop.chisq = TRUE, dnn = c("Age", "Country"))
  print(result_3)

##    Cell Contents 
## |-------------------------|
## |                       N | 
## |              Expected N | 
## | Chi-square contribution | 
## |           N / Row Total | 
## |           N / Col Total | 
## |         N / Table Total | 
## |-------------------------|
## 
## =====================================================================================
##            Country
## Age         Austria   Botswana      Ghana    Nigeria   Sth Afrc   Untd Kng      Total
## -------------------------------------------------------------------------------------
## 0 - 5             0          0          0        480          0          0        480
##                83.2       79.7       80.0       76.8       80.8       79.5           
##              83.200     79.680     80.000   2116.800     80.800     79.520           
##               0.000      0.000      0.000      1.000      0.000      0.000      0.160
##                   0          0          0          1          0          0           
##               0.000      0.000      0.000      0.160      0.000      0.000           
## -------------------------------------------------------------------------------------
## 15 - 24           0          0          0          0        505          0        505
##                87.5       83.8       84.2       80.8       85.0       83.7           
##              87.533     83.830     84.167     80.800   2075.008     83.662           
##               0.000      0.000      0.000      0.000      1.000      0.000      0.168
##                   0          0          0          0          1          0           
##               0.000      0.000      0.000      0.000      0.168      0.000           
## -------------------------------------------------------------------------------------
## 25 - 50           0        498          0          0          0          0        498
##                86.3       82.7       83.0       79.7       83.8       82.5           
##              86.320   2086.668     83.000     79.680     83.830     82.502           
##               0.000      1.000      0.000      0.000      0.000      0.000      0.166
##                   0          1          0          0          0          0           
##               0.000      0.166      0.000      0.000      0.000      0.000           
## -------------------------------------------------------------------------------------
## 51 - 64           0          0          0          0          0        497        497
##                86.1       82.5       82.8       79.5       83.7       82.3           
##              86.147     82.502     82.833     79.520     83.662   2088.336           
##               0.000      0.000      0.000      0.000      0.000      1.000      0.166
##                   0          0          0          0          0          1           
##               0.000      0.000      0.000      0.000      0.000      0.166           
## -------------------------------------------------------------------------------------
## 6 - 14            0          0        500          0          0          0        500
##                86.7       83.0       83.3       80.0       84.2       82.8           
##              86.667     83.000   2083.333     80.000     84.167     82.833           
##               0.000      0.000      1.000      0.000      0.000      0.000      0.167
##                   0          0          1          0          0          0           
##               0.000      0.000      0.167      0.000      0.000      0.000           
## -------------------------------------------------------------------------------------
## 65+             520          0          0          0          0          0        520
##                90.1       86.3       86.7       83.2       87.5       86.1           
##            2050.133     86.320     86.667     83.200     87.533     86.147           
##               1.000      0.000      0.000      0.000      0.000      0.000      0.173
##                   1          0          0          0          0          0           
##               0.173      0.000      0.000      0.000      0.000      0.000           
## -------------------------------------------------------------------------------------
## Total           520        498        500        480        505        497       3000
##               0.173      0.166      0.167      0.160      0.168      0.166           
## =====================================================================================
## 
## Statistics for All Table Factors
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 = 15000      d.f. = 25      p <2e-16

Look at what we have again in SPSS FORMAT:

 result_3B = CrossTable(Survey$Age, Survey$Country, prop.r=FALSE,prop.t=FALSE,prop.chisq=FALSE,format="SPSS")
  print(result_3B)

##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |          Column Percent | 
## |-------------------------|
## 
## ==============================================================================
##             Survey$Country
## Survy$Ag    Austria   Botswana   Ghana   Nigeria   Sth Afrc   Untd Kng   Total
## ------------------------------------------------------------------------------
## 0 - 5            0          0       0       480          0          0     480 
##                  0%         0%      0%      100%         0%         0%        
## ------------------------------------------------------------------------------
## 15 - 24          0          0       0         0        505          0     505 
##                  0%         0%      0%        0%       100%         0%        
## ------------------------------------------------------------------------------
## 25 - 50          0        498       0         0          0          0     498 
##                  0%       100%      0%        0%         0%         0%        
## ------------------------------------------------------------------------------
## 51 - 64          0          0       0         0          0        497     497 
##                  0%         0%      0%        0%         0%       100%        
## ------------------------------------------------------------------------------
## 6 - 14           0          0     500         0          0          0     500 
##                  0%         0%    100%        0%         0%         0%        
## ------------------------------------------------------------------------------
## 65+            520          0       0         0          0          0     520 
##                100%         0%      0%        0%         0%         0%        
## ------------------------------------------------------------------------------
## Total          520        498     500       480        505        497    3000 
##               17.3%      16.6%   16.7%     16.0%      16.8%      16.6%        
## ==============================================================================

Let’s look at Example below:

   result_4 = CrossTable(Survey$Gender, Survey$Country, expected = TRUE, chisq = TRUE, prop.chisq = TRUE, dnn = c("Gender", "Country"))
  print(result_4)

##    Cell Contents 
## |-------------------------|
## |                       N | 
## |              Expected N | 
## | Chi-square contribution | 
## |           N / Row Total | 
## |           N / Col Total | 
## |         N / Table Total | 
## |-------------------------|
## 
## ================================================================================
##           Country
## Gender    Austria   Botswana   Ghana   Nigeria   South Afrc   Untd Kngdm   Total
## --------------------------------------------------------------------------------
## Female        271        235     268       253          246          269    1542
##             267.3      256.0   257.0     246.7        259.6        255.5        
##             0.052      1.718   0.471     0.160        0.709        0.718        
##             0.176      0.152   0.174     0.164        0.160        0.174   0.514
##             0.521      0.472   0.536     0.527        0.487        0.541        
##             0.090      0.078   0.089     0.084        0.082        0.090        
## --------------------------------------------------------------------------------
## Male          249        263     232       227          259          228    1458
##             252.7      242.0   243.0     233.3        245.4        241.5        
##             0.055      1.817   0.498     0.169        0.750        0.759        
##             0.171      0.180   0.159     0.156        0.178        0.156   0.486
##             0.479      0.528   0.464     0.473        0.513        0.459        
##             0.083      0.088   0.077     0.076        0.086        0.076        
## --------------------------------------------------------------------------------
## Total         520        498     500       480          505          497    3000
##             0.173      0.166   0.167     0.160        0.168        0.166        
## ================================================================================
## 
## Statistics for All Table Factors
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 = 7.876522      d.f. = 5      p = 0.163

The same results above could be in SPSS FORMAT in the following codes:

  result_4B = CrossTable(Survey$Gender, Survey$Country, prop.r=FALSE,prop.t=FALSE,prop.chisq=FALSE,format="SPSS")
  print(result_4B)

##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |          Column Percent | 
## |-------------------------|
## 
## ==============================================================================
##             Survey$Country
## Srvy$Gnd    Austria   Botswana   Ghana   Nigeria   Sth Afrc   Untd Kng   Total
## ------------------------------------------------------------------------------
## Female         271        235     268       253        246        269    1542 
##               52.1%      47.2%   53.6%     52.7%      48.7%      54.1%        
## ------------------------------------------------------------------------------
## Male           249        263     232       227        259        228    1458 
##               47.9%      52.8%   46.4%     47.3%      51.3%      45.9%        
## ------------------------------------------------------------------------------
## Total          520        498     500       480        505        497    3000 
##               17.3%      16.6%   16.7%     16.0%      16.8%      16.6%        
## ==============================================================================

Another example below:

  result_5 = CrossTable(Survey$Gender, Survey$Health_Status, expected = TRUE, chisq = TRUE, prop.chisq = TRUE, dnn = c("Gender", "Health_Status"))
  print(result_5)

##    Cell Contents 
## |-------------------------|
## |                       N | 
## |              Expected N | 
## | Chi-square contribution | 
## |           N / Row Total | 
## |           N / Col Total | 
## |         N / Table Total | 
## |-------------------------|
## 
## =======================================
##           Health_Status
## Gender     Fair    Okay    Poor   Total
## ---------------------------------------
## Female      538     493     511    1542
##           524.3   515.0   502.7        
##           0.359   0.942   0.137        
##           0.349   0.320   0.331   0.514
##           0.527   0.492   0.522        
##           0.179   0.164   0.170        
## ---------------------------------------
## Male        482     509     467    1458
##           495.7   487.0   475.3        
##           0.380   0.996   0.145        
##           0.331   0.349   0.320   0.486
##           0.473   0.508   0.478        
##           0.161   0.170   0.156        
## ---------------------------------------
## Total      1020    1002     978    3000
##           0.340   0.334   0.326        
## =======================================
## 
## Statistics for All Table Factors
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 = 2.959869      d.f. = 2      p = 0.228

We can still have it in SPSS FORMAT as follows:

 result_5B = CrossTable(Survey$Gender, Survey$Health_Status,prop.r=FALSE,prop.t=FALSE,prop.chisq=FALSE,format="SPSS")
  print(result_5B)

##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |          Column Percent | 
## |-------------------------|
## 
## ==============================================
##                  Survey$Health_Status
## Survey$Gender     Fair    Okay    Poor   Total
## ----------------------------------------------
## Female            538     493     511    1542 
##                  52.7%   49.2%   52.2%        
## ----------------------------------------------
## Male              482     509     467    1458 
##                  47.3%   50.8%   47.8%        
## ----------------------------------------------
## Total            1020    1002     978    3000 
##                  34.0%   33.4%   32.6%        
## ==============================================

####EXAMPLE 2:####

  data(esoph, package = "datasets")
  #View(esoph)
  names(esoph)

## [1] "agegp"     "alcgp"     "tobgp"     "ncases"    "ncontrols"

  result_6 = CrossTable(esoph$alcgp, esoph$agegp, expected = TRUE, chisq = TRUE, prop.chisq = TRUE, dnn = c("Alcohol consumption", "Tobacco consumption"))

## Warning in chisq.test(tab, correct = FALSE, ...): Chi-squared approximation may
## be incorrect

  print(result_6)

##    Cell Contents 
## |-------------------------|
## |                       N | 
## |              Expected N | 
## | Chi-square contribution | 
## |           N / Row Total | 
## |           N / Col Total | 
## |         N / Table Total | 
## |-------------------------|
## 
## ============================================================================
##                        Tobacco consumption
## Alcohol consumption    25-34   35-44   45-54   55-64   65-74     75+   Total
## ----------------------------------------------------------------------------
## 0-39g/day                  4       4       4       4       4       3      23
##                          3.9     3.9     4.2     4.2     3.9     2.9        
##                        0.002   0.002   0.008   0.008   0.002   0.005        
##                        0.174   0.174   0.174   0.174   0.174   0.130   0.261
##                        0.267   0.267   0.250   0.250   0.267   0.273        
##                        0.045   0.045   0.045   0.045   0.045   0.034        
## ----------------------------------------------------------------------------
## 40-79                      4       4       4       4       3       4      23
##                          3.9     3.9     4.2     4.2     3.9     2.9        
##                        0.002   0.002   0.008   0.008   0.216   0.440        
##                        0.174   0.174   0.174   0.174   0.130   0.174   0.261
##                        0.267   0.267   0.250   0.250   0.200   0.364        
##                        0.045   0.045   0.045   0.045   0.034   0.045        
## ----------------------------------------------------------------------------
## 80-119                     3       4       4       4       4       2      21
##                          3.6     3.6     3.8     3.8     3.6     2.6        
##                        0.094   0.049   0.009   0.009   0.049   0.149        
##                        0.143   0.190   0.190   0.190   0.190   0.095   0.239
##                        0.200   0.267   0.250   0.250   0.267   0.182        
##                        0.034   0.045   0.045   0.045   0.045   0.023        
## ----------------------------------------------------------------------------
## 120+                       4       3       4       4       4       2      21
##                          3.6     3.6     3.8     3.8     3.6     2.6        
##                        0.049   0.094   0.009   0.009   0.049   0.149        
##                        0.190   0.143   0.190   0.190   0.190   0.095   0.239
##                        0.267   0.200   0.250   0.250   0.267   0.182        
##                        0.045   0.034   0.045   0.045   0.045   0.023        
## ----------------------------------------------------------------------------
## Total                     15      15      16      16      15      11      88
##                        0.170   0.170   0.182   0.182   0.170   0.125        
## ============================================================================
## 
## Statistics for All Table Factors
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 = 1.41891      d.f. = 15      p = 1

#### EXAMPLE 3: ####

  set.seed(234)
  sex = factor(c(rep("F", 900), rep("M", 900)))

  income = 100 * (rnorm(1800) + 5)

  weight = rep(1, 1800)

  weight[sex == "F" & income > 500] = 3

  #View(weight)

  attr(income, "label") = "Income"

  attr(sex, "label") = "Sex"

  compmeans(income, sex, col = "lightgray", ylab = "income", xlab = "sex")

## Mean value of "Income" according to "Sex"
##           Mean    N Std. Dev.
## F     497.6180  900  96.95414
## M     503.7893  900  98.07035
## Total 500.7036 1800  97.53558

  comp = compmeans(income, sex, weight, plot = FALSE)


  plot(comp, col = c("orange", "lightblue"), ylab = "income", xlab = "sex")

SIMULATION OF SURVEY DATASETS - PART 1

Developed by Timothy A. OGUNLEYE

June - August, 2022

STA 308: LAB FIELDWORK FOR SURVEY METHODS & SAMPLING THEORY

INTRODUCTION

SURVEY DATA

SIMULATION OF SURVEY DATASETS - PART 1

Developed by Timothy A. OGUNLEYE

June - August, 2022

STA 308: LAB FIELDWORK FOR SURVEY METHODS & SAMPLING THEORY

INTRODUCTION

****SURVEY DATA****

SURVEY DATA