Tidy Thinking in R¶

Professor Dr. Md. Kamrul Hasan¶

  • Set working directory: RStudio > Session > Set Working Directory > Choose Directory...
  • Loot at the Files tab and make sure it is not an alien folder
  • Find your script to open or open a new R script.
  • Do not remove the .R from the script name.
  • Bring the data set to this folder to read properly.
  • Click here to download the data.
  • data download

Piping operator %>% (Ctrl+Shift+M)¶

  • input %>% output
  • step 1 %>% step 2 %>% step 3 %>% .....
  • You cannot use %>% operator until you use some extra package.
  • R package is R library that is loaded to add extra functionalities to base R.
  • For the first time you need to install using install.packages('pacman') or from Tools > Install Packages
  • Before starting to use any function, you need to load the package by pacman::p_load(package_name)
In [42]:
pacman::p_load(readxl, openxlsx, kableExtra, tidyverse, tidymodels)
  • Load the dataset and give a name to for further use.
  • If you do not have the csv file, first download the excel file and save it as csv (comma separated value) file.
In [43]:
mydata_csv = read.csv("mydata.csv")
mydata_csv
A data.frame: 20 × 6
Xserialagefamilyincomeland
<int><int><dbl><int><dbl><chr>
1 143.3185722759.327urban
2 244.3094754017.901rural
3 349.6761213516.885rural
4 445.2115313840.730urban
5 545.3878621002.499rural
6 650.1451932901.266rural
7 746.3827541880.476urban
8 841.2048252519.266urban
9 942.9394453451.084rural
101043.6630132407.192urban
111148.6722511444.542urban
121246.0794421974.478urban
131346.2023153672.222rural
141445.3320552670.587rural
151543.3324844152.783urban
161650.3607451411.459rural
171746.4935522739.571rural
181839.1001514939.828urban
191947.1040714572.204urban
202043.5816334545.876urban
In [44]:
# Read this dataset excluding the first column
mydata_csv = read.csv("mydata.csv")[, -1]
mydata_csv
A data.frame: 20 × 5
serialagefamilyincomeland
<int><dbl><int><dbl><chr>
143.3185722759.327urban
244.3094754017.901rural
349.6761213516.885rural
445.2115313840.730urban
545.3878621002.499rural
650.1451932901.266rural
746.3827541880.476urban
841.2048252519.266urban
942.9394453451.084rural
1043.6630132407.192urban
1148.6722511444.542urban
1246.0794421974.478urban
1346.2023153672.222rural
1445.3320552670.587rural
1543.3324844152.783urban
1650.3607451411.459rural
1746.4935522739.571rural
1839.1001514939.828urban
1947.1040714572.204urban
2043.5816334545.876urban
In [45]:
# Read the dataset from the .txt file
mydata_txt = read.table("mydata.txt", header = TRUE, sep = " ")
mydata_txt
A data.frame: 20 × 5
serialagefamilyincomeland
<int><dbl><int><dbl><chr>
1 143.3185722759.327urban
2 244.3094754017.901rural
3 349.6761213516.885rural
4 445.2115313840.730urban
5 545.3878621002.499rural
6 650.1451932901.266rural
7 746.3827541880.476urban
8 841.2048252519.266urban
9 942.9394453451.084rural
101043.6630132407.192urban
111148.6722511444.542urban
121246.0794421974.478urban
131346.2023153672.222rural
141445.3320552670.587rural
151543.3324844152.783urban
161650.3607451411.459rural
171746.4935522739.571rural
181839.1001514939.828urban
191947.1040714572.204urban
202043.5816334545.876urban
In [46]:
# Read the dataset from the .xlsx file
mydata_xlsx = read.xlsx("mydata.xlsx")
mydata_xlsx
A data.frame: 20 × 5
serialagefamilyincomeland
<dbl><dbl><dbl><dbl><chr>
1 143.3185722759.327urban
2 244.3094754017.901rural
3 349.6761213516.885rural
4 445.2115313840.730urban
5 545.3878621002.499rural
6 650.1451932901.266rural
7 746.3827541880.476urban
8 841.2048252519.266urban
9 942.9394453451.084rural
101043.6630132407.192urban
111148.6722511444.542urban
121246.0794421974.478urban
131346.2023153672.222rural
141445.3320552670.587rural
151543.3324844152.783urban
161650.3607451411.459rural
171746.4935522739.571rural
181839.1001514939.828urban
191947.1040714572.204urban
202043.5816334545.876urban
  • You can specify the sheet name and data range in the read_xlsx() function.
  • read_xlsx() function is available in the readxl package.
  • Now read the excel file with data range and sheet name
In [47]:
mydata_xlsx = read_xlsx("mydata.xlsx", sheet = "Sheet 1", range = "A1:E20")
mydata_xlsx
A tibble: 19 × 5
serialagefamilyincomeland
<dbl><dbl><dbl><dbl><chr>
143.3185722759.327urban
244.3094754017.901rural
349.6761213516.885rural
445.2115313840.730urban
545.3878621002.499rural
650.1451932901.266rural
746.3827541880.476urban
841.2048252519.266urban
942.9394453451.084rural
1043.6630132407.192urban
1148.6722511444.542urban
1246.0794421974.478urban
1346.2023153672.222rural
1445.3320552670.587rural
1543.3324844152.783urban
1650.3607451411.459rural
1746.4935522739.571rural
1839.1001514939.828urban
1947.1040714572.204urban
In [48]:
# ``, '', "", [ row , column ], $, ~ are used in R.
# ``, '', "" are used for quoting.
# [ row , column ] is used for indexing.
mydata_xlsx[1:2, 2] # First and second row, second column
A tibble: 2 × 1
age
<dbl>
43.31857
44.30947
In [49]:
# $ is used to access the columns of a dataset.
mydata_xlsx$age # Access the age column
  1. 43.3185730603434
  2. 44.3094675315502
  3. 49.6761249424474
  4. 45.2115251742737
  5. 45.3878632054828
  6. 50.1451949606498
  7. 46.3827486179676
  8. 41.2048162961804
  9. 42.9394414443194
  10. 43.6630140897001
  11. 48.6722453923184
  12. 46.0794414811721
  13. 46.2023143517822
  14. 45.3320481478354
  15. 43.3324765957378
  16. 50.3607394104092
  17. 46.4935514346877
  18. 39.1001485301111
  19. 47.1040677046911
In [50]:
mydata_xlsx$family[3:5] # Access the family column from the 3rd to 5th row
  1. 1
  2. 1
  3. 2
In [51]:
mydata_xlsx[-c(2, 6), -2] # Access dataset except 2nd and 6th rows, and 2nd column
A tibble: 17 × 4
serialfamilyincomeland
<dbl><dbl><dbl><chr>
122759.327urban
313516.885rural
413840.730urban
521002.499rural
741880.476urban
852519.266urban
953451.084rural
1032407.192urban
1111444.542urban
1221974.478urban
1353672.222rural
1452670.587rural
1544152.783urban
1651411.459rural
1722739.571rural
1814939.828urban
1914572.204urban
In [52]:
# ~ is used in formulas.
# For example, y ~ x means y is modeled as a function of x.
boxplot(age ~ family, data = mydata_xlsx) # Boxplot of age by family
No description has been provided for this image
In [53]:
pacman::p_load(car) # qqPlot() function is in the car package, you can load this at the beginning.
qqPlot(age ~ family, data = mydata_xlsx)
No description has been provided for this image

Most common error source¶

  • Spellings: R is case-sensitive, so make sure to use the correct spellings.
  • Brackets: Make sure to open and close the brackets properly.
  • Quotes: Use the correct quotes (' or ").
  • Directory: Make sure to set the correct working directory.
  • Packages: Make sure to install and load the required packages.
  • Functions: Make sure to use the correct functions and arguments.
  • Run the previous related commands that define the objects in the current line.

Data processing with tidy thinking¶

Select columns and filter rows¶

  • We will use tidyverse package for data manipulation.
  • Use select to select variables and filter to filter rows.
  • We will use pipe operator %>% (Ctrl + Shift + M) to combine multiple operations.
  • Left side of the %>% will be the input to the right side function.
In [54]:
mydata_xlsx %>% 
  select(age, family, land) %>% 
  filter(age >40, family %in% c(1, 2)) # Select serial, age, family columns and filter age > 45 and family 1 or 2
A tibble: 8 × 3
agefamilyland
<dbl><dbl><chr>
43.318572urban
49.676121rural
45.211531urban
45.387862rural
48.672251urban
46.079442urban
46.493552rural
47.104071urban

Long table (stacking) and Short table (unstacking)¶

  • Stacking: Combining multiple columns into a single column.
  • Unstacking: Splitting a single column into multiple columns.
In [55]:
# Stack the age and family columns into one column named merged
long = mydata_xlsx %>% 
  pivot_longer(cols = c(age, family) , names_to = 'merged', values_to = 'values') # Combine age and family columns into merged column
long
A tibble: 38 × 5
serialincomelandmergedvalues
<dbl><dbl><chr><chr><dbl>
12759.327urbanage 43.31857
12759.327urbanfamily 2.00000
24017.901ruralage 44.30947
24017.901ruralfamily 5.00000
33516.885ruralage 49.67612
33516.885ruralfamily 1.00000
43840.730urbanage 45.21153
43840.730urbanfamily 1.00000
51002.499ruralage 45.38786
51002.499ruralfamily 2.00000
62901.266ruralage 50.14519
62901.266ruralfamily 3.00000
71880.476urbanage 46.38275
71880.476urbanfamily 4.00000
82519.266urbanage 41.20482
82519.266urbanfamily 5.00000
93451.084ruralage 42.93944
93451.084ruralfamily 5.00000
102407.192urbanage 43.66301
102407.192urbanfamily 3.00000
111444.542urbanage 48.67225
111444.542urbanfamily 1.00000
121974.478urbanage 46.07944
121974.478urbanfamily 2.00000
133672.222ruralage 46.20231
133672.222ruralfamily 5.00000
142670.587ruralage 45.33205
142670.587ruralfamily 5.00000
154152.783urbanage 43.33248
154152.783urbanfamily 4.00000
161411.459ruralage 50.36074
161411.459ruralfamily 5.00000
172739.571ruralage 46.49355
172739.571ruralfamily 2.00000
184939.828urbanage 39.10015
184939.828urbanfamily 1.00000
194572.204urbanage 47.10407
194572.204urbanfamily 1.00000
In [56]:
# Unstack the merged column into age and family columns
short = long %>% 
  pivot_wider(names_from = merged, values_from = values) # Split merged column into age and family columns
short
A tibble: 19 × 5
serialincomelandagefamily
<dbl><dbl><chr><dbl><dbl>
12759.327urban43.318572
24017.901rural44.309475
33516.885rural49.676121
43840.730urban45.211531
51002.499rural45.387862
62901.266rural50.145193
71880.476urban46.382754
82519.266urban41.204825
93451.084rural42.939445
102407.192urban43.663013
111444.542urban48.672251
121974.478urban46.079442
133672.222rural46.202315
142670.587rural45.332055
154152.783urban43.332484
161411.459rural50.360745
172739.571rural46.493552
184939.828urban39.100151
194572.204urban47.104071
In [57]:
# Create new variables (income_per_member) using mutate() function
short = short %>% 
  mutate(income_per_member = income/family) # Create new variable income_per_member
head(short)
A tibble: 6 × 6
serialincomelandagefamilyincome_per_member
<dbl><dbl><chr><dbl><dbl><dbl>
12759.327urban43.3185721379.6634
24017.901rural44.309475 803.5801
33516.885rural49.6761213516.8845
43840.730urban45.2115313840.7296
51002.499rural45.387862 501.2495
62901.266rural50.145193 967.0888
In [58]:
# Create new variable (status: poor < 1000 income_per_member) using conditional statement
short = short %>% 
  mutate(status = ifelse(income_per_member < 1000, "poor", "rich")) # Create new variable status
head(short)
A tibble: 6 × 7
serialincomelandagefamilyincome_per_memberstatus
<dbl><dbl><chr><dbl><dbl><dbl><chr>
12759.327urban43.3185721379.6634rich
24017.901rural44.309475 803.5801poor
33516.885rural49.6761213516.8845rich
43840.730urban45.2115313840.7296rich
51002.499rural45.387862 501.2495poor
62901.266rural50.145193 967.0888poor
In [59]:
# Create new variable (status1: poor < 1000 income_per_member) using case_when() function
short = short %>% 
  mutate(status1 = case_when(
    income_per_member < 1000 ~ "poor",
    income_per_member >= 1000 ~ "rich"
  )) # Create new variable status1
head(short)
A tibble: 6 × 8
serialincomelandagefamilyincome_per_memberstatusstatus1
<dbl><dbl><chr><dbl><dbl><dbl><chr><chr>
12759.327urban43.3185721379.6634richrich
24017.901rural44.309475 803.5801poorpoor
33516.885rural49.6761213516.8845richrich
43840.730urban45.2115313840.7296richrich
51002.499rural45.387862 501.2495poorpoor
62901.266rural50.145193 967.0888poorpoor
In [60]:
# Create new variable (status2: poor < 1000, middle 1000-2000, rich > 2000 income_per_member) using ifelse() function and dollar sign
# This is not tidy thinking
# Tidy thinking uses  %>% 
short$status2 = ifelse(short$income_per_member < 1000, "poor", 
    ifelse(short$income_per_member <= 2000, "middle", "rich")) # Create new variable status2
head(short)
A tibble: 6 × 9
serialincomelandagefamilyincome_per_memberstatusstatus1status2
<dbl><dbl><chr><dbl><dbl><dbl><chr><chr><chr>
12759.327urban43.3185721379.6634richrichmiddle
24017.901rural44.309475 803.5801poorpoorpoor
33516.885rural49.6761213516.8845richrichrich
43840.730urban45.2115313840.7296richrichrich
51002.499rural45.387862 501.2495poorpoorpoor
62901.266rural50.145193 967.0888poorpoorpoor
In [61]:
# Cut the variable (income_per_member) into intervals (poor < 1000, middle 1000-2000, rich > 2000) using cut() function and save as status3
# right = FALSE means the intervals are left-closed
short$status3 = cut(short$income_per_member, right = FALSE,
                    breaks = c(-Inf, 1000, 1999.9999, Inf), 
                    labels = c("poor", "middle", "rich")) # Create new variable status3
head(short)
A tibble: 6 × 10
serialincomelandagefamilyincome_per_memberstatusstatus1status2status3
<dbl><dbl><chr><dbl><dbl><dbl><chr><chr><chr><fct>
12759.327urban43.3185721379.6634richrichmiddlemiddle
24017.901rural44.309475 803.5801poorpoorpoor poor
33516.885rural49.6761213516.8845richrichrich rich
43840.730urban45.2115313840.7296richrichrich rich
51002.499rural45.387862 501.2495poorpoorpoor poor
62901.266rural50.145193 967.0888poorpoorpoor poor
In [62]:
# Create a frequency table of status3
short %>% select(status3) %>% table() # tidy way
table(short$status3) # traditional way
status3
  poor middle   rich 
    11      4      4 
  poor middle   rich 
    11      4      4 
In [63]:
# Create a cross-tabulation of status3 and land
table(short$status3, short$land) # Cross-tabulation of status3 and land
# Tidy way
short %>% select(status3, land) %>% table()
        
         rural urban
  poor       7     4
  middle     1     3
  rich       1     3
        land
status3  rural urban
  poor       7     4
  middle     1     3
  rich       1     3

Use mydata_xlsx dataset for descriptive statistics¶

In [64]:
# Mean, SD, Median, Min, Max, Range, IQR, Summary functions
# Select a variable using $ sign and apply function
mean(mydata_xlsx$income) # Mean of income_per_member
sd(mydata_xlsx$income) # Standard deviation of income_per_member
median(mydata_xlsx$income) # Median of income_per_member
min(mydata_xlsx$income) # Minimum of income_per_member
max(mydata_xlsx$income) # Maximum of income_per_member
range(mydata_xlsx$income) # Range of income_per_member
IQR(mydata_xlsx$income) # Interquartile range of income_per_member
summary(mydata_xlsx$income) # Summary of income_per_member
2940.75255586128
1111.22906687394
2759.32675041258
1002.49909330159
4939.82791993767
  1. 1002.49909330159
  2. 4939.82791993767
1565.6412136741
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1002    2191    2759    2941    3756    4940 
In [65]:
# Tidy way
# se and cv: Standard error and coefficient of variation
# se = sd/sqrt(n), cv(%) = sd/mean*100
mydata_xlsx %>% summarise(mean = mean(income),
                         sd = sd(income),
                         iqr = IQR(income),
                         cv = sd(income)*100/mean(income),
                         se = sd(income)/sqrt(n()),
                         frequency = n())
A tibble: 1 × 6
meansdiqrcvsefrequency
<dbl><dbl><dbl><dbl><dbl><int>
2940.7531111.2291565.64137.78723254.933419
In [66]:
# Tidy way: summary by another factor (land)
# se and cv: Standard error and coefficient of variation
# se = sd/sqrt(n), cv(%) = sd/mean*100
mydata_xlsx %>% group_by(land) %>% summarise(mean = mean(income),
                         sd = sd(income),
                         iqr = IQR(income),
                         cv = sd(income)*100/mean(income),
                         se = sd(income)/sqrt(n()),
                         frequency = n())
A tibble: 2 × 7
landmeansdiqrcvsefrequency
<chr><dbl><dbl><dbl><dbl><dbl><int>
rural2820.3861022.733 846.297436.26216340.9109 9
urban3049.0821229.7851992.113640.33295388.892110
In [67]:
# Multiple variable, multiple functions
# transpose using t()
# split column names and sort according to variable
# rename() the columns
# separate() into multiple or unite() into one column
# format using kble() from kableExtra package
mydata_xlsx %>% 
    select(age, income, family) %>% 
    summarise_all(list(mean = mean, max = max)) %>% 
    t() %>% 
    as.data.frame() %>% 
    round(2) %>% 
    rownames_to_column('variable') %>% 
    separate(variable, c('variable', 'statistics'), sep = '_') %>% 
    arrange(variable) %>% 
    pivot_wider(names_from = statistics, values_from = V1)
A tibble: 3 × 3
variablemeanmax
<chr><dbl><dbl>
age 45.52 50.36
family 3.00 5.00
income2940.754939.83
In [68]:
# Custom function: Create a function to calculate the standard error
# Function name is followed by parentheses and curly brackets.
se = function(x) {
  return(sd(x)/sqrt(length(x)))
}

se(mydata_xlsx$income) %>% round(2) # Standard error of income_per_member
254.93
In [69]:
# IQR=Q3−Q1, Q2 = Median
# Boxplot: Q1, Q2, Q3, Whiskers, Outliers
# The lower whisker extends from Q1 to the minimum value within 1.5 × IQR below Q1.
# The upper whisker extends from Q3 to the maximum value within 1.5 × IQR above Q3.
# Outliers are the values beyond the whiskers.
boxplot(mydata_xlsx$income) # Boxplot of income_per_member
No description has been provided for this image

Changing variable types¶

In [70]:
head(mydata_xlsx)
A tibble: 6 × 5
serialagefamilyincomeland
<dbl><dbl><dbl><dbl><chr>
143.3185722759.327urban
244.3094754017.901rural
349.6761213516.885rural
445.2115313840.730urban
545.3878621002.499rural
650.1451932901.266rural
In [71]:
# We want to change the variable types and save (overwrite) in the original name of the dataset
mydata_xlsx = mydata_xlsx %>% mutate_if(is.character, as.factor)
head(mydata_xlsx)
A tibble: 6 × 5
serialagefamilyincomeland
<dbl><dbl><dbl><dbl><fct>
143.3185722759.327urban
244.3094754017.901rural
349.6761213516.885rural
445.2115313840.730urban
545.3878621002.499rural
650.1451932901.266rural
In [72]:
mydata_xlsx = mydata_xlsx %>% mutate_at(c('serial', 'family'), as.integer)
head(mydata_xlsx)
A tibble: 6 × 5
serialagefamilyincomeland
<int><dbl><int><dbl><fct>
143.3185722759.327urban
244.3094754017.901rural
349.6761213516.885rural
445.2115313840.730urban
545.3878621002.499rural
650.1451932901.266rural
In [73]:
mydata_xlsx = mydata_xlsx %>% mutate_at('family', as.numeric) %>% mutate_at(c('age', 'income'), round, 2)
head(mydata_xlsx)
A tibble: 6 × 5
serialagefamilyincomeland
<int><dbl><dbl><dbl><fct>
143.3222759.33urban
244.3154017.90rural
349.6813516.88rural
445.2113840.73urban
545.3921002.50rural
650.1532901.27rural

Demonstrate CLT (Center Limit Theorem) using this dataset¶

  • Central Limit Theorem (CLT) states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases (n>=30).
  • The mean of the sampling distribution is equal to the population mean.
  • 68% of the sample means fall within one standard error of the population mean.
  • 95% of the sample means fall within 1.96 standard errors of the population mean.
  • 99% of the sample means fall within 2.58 standard errors of the population mean.
  • Create a sampling distribution of the sample mean
  • Set the seed for reproducibility
In [74]:
set.seed(123)
n = 30 # Sample size
B = 1000 # Number of samples
sample_means = replicate(B, mean(sample(mydata_xlsx$income, n, replace = TRUE))) # Generate B sample means

# Calculate the mean and standard deviation of the sample means
(mean_sample_means = mean(sample_means)) # Mean of the sample means
(sd_sample_means = sd(sample_means)) # Standard deviation of the sample means
(se_sample_means = sd_sample_means) # Standard error of the sample means
2948.66764966667
196.155998728114
196.155998728114
In [75]:
# Plot the sampling distribution of the sample mean
hist(sample_means, breaks = 30, col = "skyblue", main = "Sampling Distribution of the Sample Mean", xlab = "Sample Mean")
No description has been provided for this image
In [76]:
# use second bracket to combine multiline codes
{ 
  plot(density(sample_means), col = "blue", 
       main = "Sampling Distribution of the Sample Mean", 
       xlab = "x-axis => Sample Mean or z-values \n Probability = Area under the curve.", 
       ylab = "Density or relative frequency", lty = 1, lwd = 2)
  abline(v = mean_sample_means, col = "red", lty = 2, lwd = 2) # Add a vertical line for the mean of the sample means
  abline(v = mean_sample_means + c(-1, 1)*se_sample_means*1.96, col = "black", lty = 1, lwd = 2) # Add vertical lines for the mean ± 1.96 SE
  text(3000, 0.0005, paste('mean =', round(mean_sample_means)), pos = 4, col = "red", srt = 90) # Add a text for the mean of the sample means
  text(mean_sample_means, 0, '0', pos = 3, col = "black") 
  text(3400, 0.0005, 'mean+1.96se', pos = 4, col = "black", srt=90)
  text(3200, 0.0000, '+1.96', pos = 4, col = "black", srt=0)
  text(2400, 0.0005, 'mean-1.96se', pos = 4, col = "black", srt=90)
  text(2400, 0.0000, '-1.96', pos = 4, col = "black", srt=0)
  text(2900, 0.0020, '95% CI [2557, 3341]', pos = 1, col = "black", srt=0)
}
No description has been provided for this image
In [77]:
# Now sort the mydata_xlsx$income dataset by income in ascending order
income_ascending = mydata_xlsx %>% select(income) %>% arrange()
income_ascending
A tibble: 19 × 1
income
<dbl>
2759.33
4017.90
3516.88
3840.73
1002.50
2901.27
1880.48
2519.27
3451.08
2407.19
1444.54
1974.48
3672.22
2670.59
4152.78
1411.46
2739.57
4939.83
4572.20
In [78]:
# Now sort the mydata_xlsx$income dataset by income in ascending order
income_ascending = sort(mydata_xlsx$income, decreasing = FALSE)
# Or tidy thinking
income_ascending = mydata_xlsx %>% arrange(income, decreasing = FALSE)
head(income_ascending)
A tibble: 6 × 5
serialagefamilyincomeland
<int><dbl><dbl><dbl><fct>
545.3921002.50rural
1650.3651411.46rural
1148.6711444.54urban
746.3841880.48urban
1246.0821974.48urban
1043.6632407.19urban
In [79]:
# Find the 2.5th and 97.5th percentiles of the income dataset
income_ascending %>% select(income) %>% summarise_all(quantile, c(0.025, 0.975)) %>% round(2)
# 95% CI [1186.531, 4774.397]
Warning message:
"Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.
ℹ The deprecated feature was likely used in the dplyr package.
  Please report the issue at <https://github.com/tidyverse/dplyr/issues>."
A tibble: 2 × 1
income
<dbl>
1186.53
4774.40
In [80]:
# Calculate CI from the sample mean
mean_income = mean(mydata_xlsx$income)
sd_income = sd(mydata_xlsx$income)
se_income = se(mydata_xlsx$income)
In [81]:
c((mean_income - 1.96*se_income),(mean_income + 1.96*se_income)) %>% round(2)
# 95% CI [2441.083, 3440.422]
  1. 2441.08
  2. 3440.42
  • From the plot, 95% CI [2557, 3341] and 95% CI [2441.083, 3440.422] are close to each other. These are based on distribution based.
  • If the sample is not normally distributed, the CI based on the distribution may not be accurate.
  • Therefore, we need to check the normality of a variable or residuals when it is necessary, otherwise the test statistics will not be reliable.
In [82]:
# Conditional statement
# if, else if, else
# if (condition) {statement} else {statement}
# if (condition) {statement} else if (condition) {statement} else {statement}

if (mean(mydata_xlsx$income) > 3000) {
  print("The mean income is greater than 3000.")
} else {
  print("The mean income is less than or equal to 3000.")
}
[1] "The mean income is less than or equal to 3000."
In [83]:
if (mean(mydata_xlsx$income) > 4000) {
  print("The mean income is greater than 4000.")
} else if (mean(mydata_xlsx$income) > 3000) {
  print("The mean income is greater than 3000 but less than or equal to 4000.")
} else {
  print("The mean income is less than or equal to 3000.")
}
[1] "The mean income is less than or equal to 3000."
In [84]:
# Loop function
# for loop, while loop
# for (variable in sequence) {statement}
for (i in 1:5) {
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
In [85]:
# Iterate the mean function over three columns of the dataset and same the means in means object
means = c() # blank vector for storing calculated means
for (column in c("age", "family", "income")) {
  means = c(means, mean(mydata_xlsx[[column]]))
}
means %>% round(2)
  1. 45.52
  2. 3
  3. 2940.75
In [86]:
# same output using apply function
apply(mydata_xlsx[c("age", "family", "income")], 2, mean) %>% round(2)
age
45.52
family
3
income
2940.75
In [87]:
# same output in tidy thinking
mydata_xlsx %>% select(age, family, income) %>% summarise_all(mean) %>% round(2)
A tibble: 1 × 3
agefamilyincome
<dbl><dbl><dbl>
45.5232940.75
In [88]:
# Look the format of the column indexing.
# mydata_xlsx[[column]] is used for indexing the columns of the dataset.
# Here, column = 1, in the first iteration to calculate the mean of age
# column = 2, in the second iteration to calculate the mean of family, and so on.
# Just for your checking
mydata_xlsx[[2]] # First column of the dataset (value only)
mydata_xlsx['age'] # This contains variable name and values
mydata_xlsx[['age']] # This contains only values
  1. 43.32
  2. 44.31
  3. 49.68
  4. 45.21
  5. 45.39
  6. 50.15
  7. 46.38
  8. 41.2
  9. 42.94
  10. 43.66
  11. 48.67
  12. 46.08
  13. 46.2
  14. 45.33
  15. 43.33
  16. 50.36
  17. 46.49
  18. 39.1
  19. 47.1
A tibble: 19 × 1
age
<dbl>
43.32
44.31
49.68
45.21
45.39
50.15
46.38
41.20
42.94
43.66
48.67
46.08
46.20
45.33
43.33
50.36
46.49
39.10
47.10
  1. 43.32
  2. 44.31
  3. 49.68
  4. 45.21
  5. 45.39
  6. 50.15
  7. 46.38
  8. 41.2
  9. 42.94
  10. 43.66
  11. 48.67
  12. 46.08
  13. 46.2
  14. 45.33
  15. 43.33
  16. 50.36
  17. 46.49
  18. 39.1
  19. 47.1

Data visualization: ggplot2 package¶

In [89]:
# ggplot2 is a powerful package for creating graphics in R.
# It is based on the grammar of graphics, which is a structured way (layer by layer) of creating graphics.
# There are different ways to put the dataset name and aes-thetics as well as geom-etric shapes
ggplot(mtcars) + 
    geom_violin(aes(cyl, mpg, group = cyl))
No description has been provided for this image
In [90]:
ggplot(mtcars, aes(cyl, mpg, group = cyl)) + 
    geom_violin()
No description has been provided for this image
In [91]:
ggplot(mtcars) + 
    aes(cyl, mpg, group = cyl) + 
    geom_violin()
No description has been provided for this image
In [92]:
mtcars %>% ggplot() + 
    aes(cyl, mpg, group = cyl) + 
    geom_violin()
No description has been provided for this image
In [93]:
mtcars %>% ggplot(aes(cyl, mpg, group = cyl) ) + 
    geom_violin()
No description has been provided for this image
In [94]:
p = mtcars %>% ggplot(aes(cyl, mpg, group = cyl))
p = p + geom_violin()
p = p + theme_bw()
p
No description has been provided for this image
In [95]:
# Scattered plot of x=age and y=income
pacman::p_load(jtools) # for theme_apa()
ggplot(mydata_xlsx, aes(x = age, y = income)) + 
  geom_point() + # Scatter plot of age and income
  labs(title = "Scatter plot of Age and Income", x = "Age", y = "Income") + # Add title and axis labels
  theme_bw()
No description has been provided for this image
In [96]:
# Boxplot of income by land
ggplot(mydata_xlsx, aes(x = land, y = income)) + 
  geom_boxplot() + # Boxplot of income by land
  labs(title = "Boxplot of Income by Land", x = "Land", y = "Income") + # Add title and axis labels
  theme_apa()
No description has been provided for this image
In [97]:
# Histogram of income
ggplot(mydata_xlsx, aes(x = income)) + 
  geom_histogram(binwidth = 500, fill = "skyblue", color = "white") + # Histogram of income
  labs(title = "Histogram of Income", x = "Income", y = "Frequency") + # Add title and axis labels
  theme_test()
No description has been provided for this image
In [98]:
# Density plot of income by land
ggplot(mydata_xlsx, aes(x = income, fill = land)) + 
  geom_density(alpha = 0.5) + 
  labs(title = "Fig. 1.1 Density Plot of Income by Land", x = "Income", y = "Density") +

  theme_apa()+
  theme(plot.title = element_text(hjust = 0.5, vjust = -150),
        plot.margin = margin(10, 10, 50, 10),
        legend.position = 'top')
No description has been provided for this image
In [99]:
# Bar plot of family by land
ggplot(mydata_xlsx, aes(x = family, fill = land)) + 
  geom_bar(position = "dodge") + # Bar plot of family by land
  labs(title = "Bar Plot of Family by Land", x = "Family", y = "Count") + 
  theme_apa()
No description has been provided for this image
In [100]:
# Scatter plot of income by age
ggplot(mydata_xlsx, aes(x = age, y = income, group = 1)) + 
  geom_point(color = "tomato") + 
  labs(title = "Line Plot of Income by Age", x = "Age", y = "Income") + 
  theme_apa() +
  geom_smooth(method = "lm", se = TRUE, formula = y ~ x)
No description has been provided for this image
In [101]:
# Pie chart of family labelled by percentage
# Calculate counts and proportions for each family
pie_data = mydata_xlsx %>%
  group_by(family) %>%
  summarise(count = n()) %>%
  mutate(percentage = 100 * count / sum(count))
In [102]:
# Create the pie chart
ggplot(pie_data, aes(x = "", y = count, fill = factor(family))) +
  geom_bar(stat = "identity", width = 1) + 
  coord_polar("y", start = 0) +  
  geom_text(aes(label = paste0(round(percentage, 1), '%')   ), 
            position = position_stack(vjust = 0.5), size = 6) + 
  labs(title = "Pie Chart of Family", fill = "Family") + 
  scale_fill_discrete('Family size') +
  theme_void() +  
  theme(legend.position = "right",
        plot.title = element_text(hjust = 0.5, size = rel(2)))
No description has been provided for this image
In [103]:
# Using base pie()
x = pie_data$count
percent = paste0(round(pie_data$percentage, 2), '%')
categories = c('One member', 'Two member', 'Three member', 'Four member', 'Five member')
labels = paste0(categories, '\n [', percent, ']')

pie(x, labels, col = rainbow(5))
title(main = "Fig. Pie Chart of Family", adj = 0.5, line = -27)
No description has been provided for this image

Bar chart with calculated mean and 95% confidence interval¶

In [104]:
# Read specific range (B7:N507) from a sheet (wrangling) in excel file (DataSets.xlsx)
df = read_excel('DataSets.xlsx', sheet = 'wrangling', range = 'B7:N507')
head(df)
A tibble: 6 × 13
sr.janfebmaraprmayjunjulaugsepoctnovdec
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
1011000000000
2011100000000
3101010000000
4001101001000
5100000000100
6010000000000
In [105]:
# We will exclude sr. column
df %>% select(-sr.) %>% head()
A tibble: 6 × 12
janfebmaraprmayjunjulaugsepoctnovdec
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
011000000000
011100000000
101010000000
001101001000
100000000100
010000000000
In [106]:
# We will calculate means, se, upper and lower limits from this dataset and plot in ggplot
# Define the lower and upper function
# Reorder factors levels: # plot_df$Months = factor(plot_df$Months, 
# levels = c('jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec'), 
# labels = c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'))

se = function(x){sd(x)/sqrt(length(x))}
lower95 = function(x){mean(x) - 1.96*se(x)}
upper95 = function(x){mean(x) + 1.96*se(x)}

plot_df = df %>% 
    select(-sr.) %>% 
    summarise_all(list(mean = mean, lower = lower95, upper = upper95)) %>% 
    t() %>% as.data.frame() %>% 
    rownames_to_column('variable') %>% 
    separate(variable, c('Months', 'Statistics')) %>% 
    pivot_wider(names_from = Statistics, values_from = V1) %>% 
    mutate(Months = factor(Months, levels = c('jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec'),
                          labels = c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec')))
plot_df
A tibble: 12 × 4
Monthsmeanlowerupper
<fct><dbl><dbl><dbl>
Jan0.2860.246350480.32564952
Feb0.3540.312041140.39595886
Mar0.3920.349164820.43483518
Apr0.3580.315935590.40006441
May0.2240.187418570.26058143
Jun0.2180.181772580.25422742
Jul0.1800.146290760.21370924
Aug0.0900.064889940.11511006
Sep0.0820.057926790.10607321
Oct0.0800.056196290.10380371
Nov0.0760.052748630.09925137
Dec0.0420.024399980.05960002
In [107]:
ggplot(plot_df, aes(x = Months, y = mean, group = 1, fill = Months)) +
    geom_col() +
    geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.25) +
    geom_text(aes(y = upper + 0.01, label = formatC(mean, format = 'f', digits = 2))) +
    theme_bw() +
    theme(legend.position = 'none') +
    labs(x = '', y = 'Average proportion of water scarcity')
No description has been provided for this image
In [108]:
# More complex, try to learn on your own
# Spider plot or radar chart
# devtools::install_github("ricardo-bion/ggradar")
pacman::p_load(ggradar, scales, patchwork)
In [109]:
# Create a dataset for the spider plot
set.seed(1)
spider_data = as.data.frame(
  matrix(sample(1:10 , 96 , replace = TRUE),
         ncol=8, byrow = TRUE))
spider_data
A data.frame: 12 × 8
V1V2V3V4V5V6V7V8
<int><int><int><int><int><int><int><int>
9 4 7 1 2 7 2 3
1 5 510 610 7 9
5 5 9 9 5 5 210
9 1 4 3 61010 6
4 410 9 7 6 9 8
9 7 8 610 7 310
6 8 2 2 6 6 1 3
3 8 6 7 6 8 7 1
4 8 9 9 7 4 7 6
1 5 6 1 9 7 7 3
6 21010 7 3 210
11010 810 5 7 8
In [110]:
# Change the column names
colnames(spider_data) = c(  "Dhaka", "Khulna", "Barishal", "Chittagong", "Rangpur", 
                            "Rajshahi", "Sylhet", "Cumilla")
In [111]:
# Scale the data between 0 and 1 and keep two decimal places
scaled_data <- round(apply(spider_data, 2, scales::rescale), 2)
plot_data <- as.data.frame(scaled_data)
plot_data
A data.frame: 12 × 8
DhakaKhulnaBarishalChittagongRangpurRajshahiSylhetCumilla
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
1.000.330.620.000.000.570.110.22
0.000.440.381.000.501.000.670.89
0.500.440.880.890.380.290.111.00
1.000.000.250.220.501.001.000.56
0.380.331.000.890.620.430.890.78
1.000.670.750.561.000.570.221.00
0.620.780.000.110.500.430.000.22
0.250.780.500.670.500.710.670.00
0.380.780.880.890.620.140.670.56
0.000.440.500.000.880.570.670.22
0.620.111.001.000.620.000.111.00
0.001.001.000.781.000.290.670.78
In [112]:
# Create the radar plot for the first row of the dataset
ggradar(plot_data[1, ], 
        values.radar = c('0%', '50%', '100%'),
        grid.min = 0, grid.mid = 0.5, grid.max = 1,
        legend.position = "top")
No description has been provided for this image
In [113]:
# Create the radar plot for two rows of the dataset 
ggradar(plot_data[2:3,], 
        values.radar = c('0%', '50%', '100%'),
        grid.min = 0, grid.mid = 0.5, grid.max = 1,
        legend.position = "top")
No description has been provided for this image
In [114]:
# Create the radar plot for all rows of the dataset using a loop
# Create an empty list to store the plots
options(repr.plot.height=18)
{radar_plots = list()
  
  for (i in 1:nrow(plot_data)) {
    radar_plot = ggradar(plot_data[i,], 
                         values.radar = c('0%', '50%', '100%'),
                         grid.min = 0, grid.mid = 0.5, grid.max = 1,
                         plot.title = row.names(plot_data)[i],
                         legend.position = "top",
                         group.point.size = 3,
                         group.line.width = 1,
                         axis.label.size = 4,
                         grid.label.size = 4,
                         base.size = 15) +
      theme(plot.title = element_text(size = 15, hjust = 0.5))
    
    radar_plots[[i]] = radar_plot # Store the radar plot in the list
  }
  
  wrap_plots(radar_plots, ncol = 3) # Combine the radar plots into a single plot
}
No description has been provided for this image
In [115]:
# Save the plot as a .png file
ggsave("spider_plot.png", dpi = 300, width = 8, height = 9, units = 'in')

Click here for additional bivariate and multivariate plots commands.¶