Working with Data

Recitation 2

Rithika Kumar

September 12, 2019

Goal for today

  1. Review Week 1
  2. Importing Data
  3. Exploring Data/Functions
  4. Save/Load data

Review

Revisiting Rstudio

getwd()
setwd("/Users/rithika/Google\ Drive/Penn/TA
      /Intro\ to\ DS/Rk_Recitation/Data") 
#insert the path of the directory of your choice

Working with Data

Importing Data

  1. Download country_profile_variables.csv from Canvas. Place in your directory

  2. Use read.csv() to load the dataset

#Loading the dataset

country.profile <- read.csv("country_profile_variables.csv")
  1. Don’t forget to assign a name to the dataset (Remember what we discussed about R Objects in the previous recitation?)

Now we have a data set to explore!

Exploring our dataset

Properties of the dataset

The first thing we want to know when we start working with a new DF is:

  1. How many rows and coloumns does it have
dim(country.profile) 
## 229  50
  1. *What are the column names?
colnames(country.profile) 
## [1] "country" [2] "Region" [3] "Surface.area..km2."
##...and so on                            

Each column name that we have here is the name of the variable in the dataset

  1. *What does our data look like really?
#Displaying the top 2 rows within our dataset
head(country.profile,2)
##    country         Region Surface.area..km2. Population.in.thousands..2017.
## 1 Afghanistan   SouthernAsia             652864          35530

TIP: Use ? to get hep on a function eg. ?class()

Data structures

  1. Class of a column ($ sign lets you call a column from the df)
class(country.profile$country) 
## [1] "factor"       

class(country_prof$Sex.ratio..m.per.100.f..2017.)
#[1] "numeric"
Class of Objects (Source: adapted from Evelyne Brie)

Class of Objects (Source: adapted from Evelyne Brie)

  1. Getting a better idea of the values in the dataset
summary(country.profile) 
  1. Similarly, you can simply get the summary of one column
summary(country.profile$country) 
summary(country.profile$Sex.ratio..m.per.100.f..2017.) 
#  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# -99.0    96.4    99.0   100.2   101.7   301.2 

If you only call country.profile$country without any funciton it will display the whole column. Give it a try.

Subsetting data

We find strangely that there is a country with a sex ratio of -99. We don’t want this in our df and so let’s try to get rid of it.

  1. Identify how many rows have this minimum value
which(country.profile$
        Sex.ratio..m.per.100.f..2017. == -99)
# This asks, which is the row that has 
# the min value within the sex ratio col

#Displaying the result

# [1] 26 170

# With this we also know that there are 
#two countries with the SR -99. 

Indexing

Before going ahead, let’s remind outselves of indexing ie the square brackets

country.profile[22,2] - get data from cell

country.profile[22,] - if you want all columns from a specific row

Try: country.profile[22[,3] - what do you get?

  1. Create a new indicator variable called “SR_updated”
country_prof$SR_updated <- NA # Creating an empty variable

country_prof$SR_updated[country.profile$
      Sex.ratio..m.per.100.f..2017. ==-99] <-0
# All countries with SR = 99 are attributed a value of 1 
country_prof$SR_updated[country.profile$
         Sex.ratio..m.per.100.f..2017. > -99] <- 1 
# All countries  with SR > -99 are attributed a value of 0

  1. Creating a new dataset called “country2” using subset()
country2 <- subset(country.profile, SR_updated==1)

# Alternatively: creating a new dataset called "Voters" using %in% 
#(same output as using subset())
country2 <- country2[country.profile$SR_updated %in% 1,]

  1. Now look at the dimensions of the dataset using dim()
dim(country2)
## [1] 10  7

# How many rows were deleted from the original dataset?
dim(country.profile)[1] - dim(country2)[1]
## [1] 1

Saving our data

Saving our New Dataset as an .Rdata File Using save()

Let’s save our new “country2” dataset as an .Rdata file, a format designed for use with R.

save(country2, file="country2.RData")

Now go to your WD and see if this file has been saved there.

Exercise:

  1. Using the country2 dataset you created, identify the class of the column called Region

Relevant function: class()

  1. Get the names of Regions that the countries in the df belong to.

Relevant function: table()

  1. Looking at this table that you just created, identify the region that has the most number of countries within it?

  2. Now create a subset called w.asia of the countries that lie in Western Asia

Relevant function: subset()

  1. Find the maximum sex ratio in this new dataset

Relevant function: max() or summary()