Data import to R studio

P K Parida, CRFM

Look at your file

  • Look at your file location

  • Look at file extension / type ( .csv, .xlsx)

  • Right click on you file and go to the properties and see the location, copy the location and paste in the source editor pane of R studio

  • Change the back slash'\' format to forward slash format “/”

  • "C:\ Users\ PK Parida\ Documents" to "C:/Users/PK Parida/Documents" and then a forward slash and then the file name with extension

read your file by read.csv or read_csv or read_excel

Read the file

  • w1 <- read.csv(“~/mc3.csv”)

  • Library (Tidyverse)

  • w1 <- read_csv (“~/mc3.csv”)

  • if you want to read the excel file

  • library(readxl)

  • dataset <- read_excel(“~/table6_3.xlsx”)

read_csv() will always read variables containing text as character variables. In contrast, the base R function read.csv() will, by default, convert any character variable to a factor

Direct import the file from the Environment pane

  • click on import sign in the environment pane

  • select the type of file you are interested

  • then browse the file from your location, select the file and click on open

  • the details of file will open in environment pane

  • select the sheet number or name if it is an excel file

for csv or TSV file select the “From Text (base)” in the import sign

saving a file in R studio

  • CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in.

  • `write_csv(table6, “table6.csv”)`

  • where student is the name of the analysis in R

  • write_rds() and read_rds() are uniform wrappers around the base functions readRDS() and saveRDS().

  • `write_rds(table6, “table6.rds”)`

  • reading a rds file in R : `read_rds(“table6.rds”)`

Reading the semicolon - separated file and TSV files

  • read_csv2() reads semicolon-separated files. These use ; instead of , to separate fields and are common in countries that use , as the decimal marker.

  • read_tsv() reads tab-delimited files.Reading file in R studio

Reading package in R studio

library(tidyverse)

Read Table from excel file

library(readxl)

table2 <- read_excel("table6_3.xlsx", sheet = "Sheet2")

glimpse(table2)
Rows: 12
Columns: 4
$ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "B…
$ year    <dbl> 1999, 1999, 2000, 2000, 1999, 1999, 2000, 2000, 1999, 1999, 20…
$ type    <chr> "cases", "population", "cases", "population", "cases", "popula…
$ count   <dbl> 745, 19987071, 2666, 20595360, 37737, 172006362, 80488, 174504…

Cases per country per year

table2 |>
  group_by(Country, year) |>
  filter(type ==  "cases")
# A tibble: 6 × 4
# Groups:   Country, year [6]
  Country      year type   count
  <chr>       <dbl> <chr>  <dbl>
1 Afghanistan  1999 cases    745
2 Afghanistan  2000 cases   2666
3 Brazil       1999 cases  37737
4 Brazil       2000 cases  80488
5 China        1999 cases 212258
6 China        2000 cases 213766

Cases per country per year - 2

table2 |>
  group_by(Country, year) |>
  filter(type %in%  "cases")
# A tibble: 6 × 4
# Groups:   Country, year [6]
  Country      year type   count
  <chr>       <dbl> <chr>  <dbl>
1 Afghanistan  1999 cases    745
2 Afghanistan  2000 cases   2666
3 Brazil       1999 cases  37737
4 Brazil       2000 cases  80488
5 China        1999 cases 212258
6 China        2000 cases 213766

Wide the Table

table2_1 = pivot_wider(table2, names_from = type, values_from = count)

glimpse(table2_1)
Rows: 6
Columns: 4
$ Country    <chr> "Afghanistan", "Afghanistan", "Brazil", "Brazil", "China", …
$ year       <dbl> 1999, 2000, 1999, 2000, 1999, 2000
$ cases      <dbl> 745, 2666, 37737, 80488, 212258, 213766
$ population <dbl> 19987071, 20595360, 172006362, 174504898, 1272915272, 12804…

Extract the TB cases per year

table2_1 |> 
  group_by(Country, year, cases) |> 
  summarize(total_cases = sum(cases))
# A tibble: 6 × 4
# Groups:   Country, year [6]
  Country      year  cases total_cases
  <chr>       <dbl>  <dbl>       <dbl>
1 Afghanistan  1999    745         745
2 Afghanistan  2000   2666        2666
3 Brazil       1999  37737       37737
4 Brazil       2000  80488       80488
5 China        1999 212258      212258
6 China        2000 213766      213766

Alternatively

# as the cases and sum of case per year is same, so alternatively

table2_1 |> 
  select(Country, year, cases)
# A tibble: 6 × 3
  Country      year  cases
  <chr>       <dbl>  <dbl>
1 Afghanistan  1999    745
2 Afghanistan  2000   2666
3 Brazil       1999  37737
4 Brazil       2000  80488
5 China        1999 212258
6 China        2000 213766

Rate of cases

# caculation of rate of cases

#Rate: Divide cases by population, and multiply by 10000

table2_1 |>
  mutate(rate = cases / population * 10000)
# A tibble: 6 × 5
  Country      year  cases population  rate
  <chr>       <dbl>  <dbl>      <dbl> <dbl>
1 Afghanistan  1999    745   19987071 0.373
2 Afghanistan  2000   2666   20595360 1.29 
3 Brazil       1999  37737  172006362 2.19 
4 Brazil       2000  80488  174504898 4.61 
5 China        1999 212258 1272915272 1.67 
6 China        2000 213766 1280428583 1.67 

Numbers in Different format

table3 <- read_excel("table6_3.xlsx")

glimpse(table3)
Rows: 6
Columns: 3
$ Country <chr> "Afghanistan", "Afghanistan", "Brazil", "Brazil", "China", "Ch…
$ year    <dbl> 1999, 2000, 1999, 2000, 1999, 2000
$ rate    <chr> "745/19987071", "2666/20595360", "37737/172006362", "80488/174…

Separte the numbers from the above Table

table3_1 = table3 |> 
  separate(rate, c("cases", "population")) 

glimpse(table3_1)
Rows: 6
Columns: 4
$ Country    <chr> "Afghanistan", "Afghanistan", "Brazil", "Brazil", "China", …
$ year       <dbl> 1999, 2000, 1999, 2000, 1999, 2000
$ cases      <chr> "745", "2666", "37737", "80488", "212258", "213766"
$ population <chr> "19987071", "20595360", "172006362", "174504898", "12729152…