Working with Data

Recitation 2

Rithika Kumar

September 12, 2019

Goal for today

Review Week 1
Importing Data
Exploring Data/Functions
Save/Load data

Revisiting Rstudio

Open up Rstudio
Create and save a script
Create a new object
Set your working directory

getwd()
setwd("/Users/rithika/Google\ Drive/Penn/TA
      /Intro\ to\ DS/Rk_Recitation/Data") 
#insert the path of the directory of your choice

Importing Data

Download country_profile_variables.csv from Canvas. Place in your directory
Use read.csv() to load the dataset

#Loading the dataset

country.profile <- read.csv("country_profile_variables.csv")

Don’t forget to assign a name to the dataset (Remember what we discussed about R Objects in the previous recitation?)

Now we have a data set to explore!

Properties of the dataset

The first thing we want to know when we start working with a new DF is:

How many rows and coloumns does it have

dim(country.profile) 
## 229  50

*What are the column names?

colnames(country.profile) 
## [1] "country" [2] "Region" [3] "Surface.area..km2."
##...and so on

Each column name that we have here is the name of the variable in the dataset

*What does our data look like really?

#Displaying the top 2 rows within our dataset
head(country.profile,2)
##    country         Region Surface.area..km2. Population.in.thousands..2017.
## 1 Afghanistan   SouthernAsia             652864          35530

TIP: Use ? to get hep on a function eg. ?class()

Data structures

Class of a column ($ sign lets you call a column from the df)

class(country.profile$country) 
## [1] "factor"       

class(country_prof$Sex.ratio..m.per.100.f..2017.)
#[1] "numeric"

Class of Objects (Source: adapted from Evelyne Brie)

Getting a better idea of the values in the dataset

summary(country.profile)

Similarly, you can simply get the summary of one column

summary(country.profile$country) 
summary(country.profile$Sex.ratio..m.per.100.f..2017.) 
#  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# -99.0    96.4    99.0   100.2   101.7   301.2

If you only call country.profile$country without any funciton it will display the whole column. Give it a try.

We find strangely that there is a country with a sex ratio of -99. We don’t want this in our df and so let’s try to get rid of it.

Identify how many rows have this minimum value

which(country.profile$
        Sex.ratio..m.per.100.f..2017. == -99)
# This asks, which is the row that has 
# the min value within the sex ratio col

#Displaying the result

# [1] 26 170

# With this we also know that there are 
#two countries with the SR -99.

Indexing

Before going ahead, let’s remind outselves of indexing ie the square brackets

country.profile[22,2] - get data from cell

country.profile[22,] - if you want all columns from a specific row

Try: country.profile[22[,3] - what do you get?

Create a new indicator variable called “SR_updated”

country_prof$SR_updated <- NA # Creating an empty variable

country_prof$SR_updated[country.profile$
      Sex.ratio..m.per.100.f..2017. ==-99] <-0
# All countries with SR = 99 are attributed a value of 1 
country_prof$SR_updated[country.profile$
         Sex.ratio..m.per.100.f..2017. > -99] <- 1 
# All countries  with SR > -99 are attributed a value of 0

Creating a new dataset called “country2” using subset()

country2 <- subset(country.profile, SR_updated==1)

# Alternatively: creating a new dataset called "Voters" using %in% 
#(same output as using subset())
country2 <- country2[country.profile$SR_updated %in% 1,]

Now look at the dimensions of the dataset using dim()

dim(country2)
## [1] 10  7

# How many rows were deleted from the original dataset?
dim(country.profile)[1] - dim(country2)[1]
## [1] 1

Saving our data

Saving our New Dataset as an .Rdata File Using save()

Let’s save our new “country2” dataset as an .Rdata file, a format designed for use with R.

save(country2, file="country2.RData")

Now go to your WD and see if this file has been saved there.

Exercise:

Using the country2 dataset you created, identify the class of the column called Region

Relevant function: class()

Get the names of Regions that the countries in the df belong to.

Relevant function: table()

Looking at this table that you just created, identify the region that has the most number of countries within it?
Now create a subset called w.asia of the countries that lie in Western Asia

Relevant function: subset()

Find the maximum sex ratio in this new dataset

Relevant function: max() or summary()

Working with Data

Recitation 2

Goal for today

Review

Revisiting Rstudio

Working with Data

Importing Data

Exploring our dataset

Properties of the dataset

Data structures

Subsetting data

Indexing

Saving our data

Exercise: