Cheat Sheet for Data Scientist- 10 tasks

Programming Cheat Sheets

This cheat sheet is created in order to be helpful to do data analysis in R and Python Programming languages. It includes the start of data analysis from the very start; i.e. loading a data file to intermediate concepts composing of drawing plots and graphs. Equivalent codes are described in the sheet.

Task 1- Loading data files

The very first of the task of a data scientist is to load data files. The loaded files should sit in a dataframe variable, so that further computations can be easier. Here are the code examples that can be used to load simple data types.

Python Code

import pandas as pd

#to load csv file
data = pd.read_csv("sydneybeaches.csv")

#to load an xls file
data= pd.read_excel("filename.xls")

R Code

#to load csv file
data <- read.csv("sydneybeaches.csv")

#to load xls file
library(readxl)
data <- read_excel("filename.xls")

Task 2- Peeking into the loaded data

Now, we have our data frames which have the data of the files. We will have to take a peek at the data, so that we can determine which actions to take on them. One method of doing so is to go out of the IDE and open files separately. The next and easier option is to simply type some codes that will describe the data that has been loaded to you.

Python Code

import pandas as pd

#to statistical description of data, i.e. mean, median and deviations
data.describe()

#to peek the first 10 rows; default is 5
data.head(10)

#to peek the last 10 rows; default is 5
data.tail(10)

#to see the data types of the data
data.dftypes

R Code

#to see the statistical description of data
summary(data)

#to see the first 10 rows
head(data, 10)

#to see the last 10 rows
tail(data, 10)

#to see the data types of the data
sapply(data, class)

Task 3- Renaming Columns for Ease of use

It is very difficult to work with columns having special characters, numbers and spaces. So the wisest thing to do is to rename the columns in the dataframe so that it can be accessed easily.

Python Code

#edit column names
data.columns = ['beach_id', 'region', 'council_name', 'site', 'lon', 'lat', 'recorded_date', 'enterococci' ]

R Code

#edit column names
colnames(data) <- c('beach_id', 'region', 'council_name', 'site', 'lon', 'lat', 'recorded_date', 'enterococci' )

Task 4- Manipulating the data

After loading the data, as a data scientist, you always need to manipulate data in some way. These are some basic functions that can be used to manipulate dataframes.

Python Code

#load dfply library for manipulating dataframes
from dfply import *

# Filtering out all zeros
data >> mask(X.enterococci>0)

#creating a new column
data >> mutate(ent_per_litre = X.enterococci/10)

#calculate mean value
data >> summarize(mean = X.mean())

#select only some rows
data >> arrange(X.enterococci, ascending = False)

#group by
data >> group_by(X.site)

#select specific columns
data >> select(X.site, X.enterococci)

R Code

#load dplyr library for manipulating dataframes
library(dplyr)

# Filtering out all zeros
data %>% filter(enterococci>0)

#creating a new column
data %>% mutate(ent_per_litre = enterococci/10)

#calculate mean value
data %>% summarize(mean = mean(enterococci, na.rm = TRUE))

#select only some rows
data %>% arrange(desc(enterococci))

#group by
data %>% group_by(site)

#select specific columns
data %>% select(site, enterococci)

Task 5- Formatting data types

It is a must for a data scientist to manipulate strings and dates. Its always a good tool when dealing with different formats of text.

Python Code

# string concatenation in a dataframe
data >> mutate(council_address = X.council_name + ', '+ X.site)

# replace words
data >> mutate(new_col= X.region.replace("Sydney City Ocean Beaches", "SCOB"))

# change cases upper, lower
data >> mutate(new_col = X.region.str.upper(), newcol2= X.region.str.)

# format dates
data['recorded_date'] = pd.to_datetime(data['recorded_date'], format='%d/%m/%Y')

R Code

# string concatenation
data %>% mutate(council_address = paste(council_name, ', ', site))

# replace words
data %>% mutate(new_col= gsub(region, "Sydney City Ocean Beaches", "SCOB"))

# change cases upper, lower
data %>% mutate(new_col= toupper(region), newcol2 = tolower(region))

# format dates
data$recorded_date <- as.Date(data$recorded_date, "%d/%m/%Y")

Task 6- Creating functions

Often, in programming, one may need to do a series of tasks repeatedly. For this, there are two approaches; either segregate the code and create a function which can be called to perform the task, or copy and paste all the codes and change them to suit the needs. Creating functions is a good approach to do repeatitive tasks. It helps in managability and readability of the code.

Python Code

# Creating a function with parameter as the date list
def change_to_date(my_date_list):
  return pd.to_datetime(my_date_list, format='%d/%m/%Y')


#calling the function to change the date
data['recorded_date'] = change_to_date(data['recorded_date'])

# Creating a function with parameter as the date list
change_to_date <- function(my_date_list) {
  return(as.Date(my_date_list, "%d/%m/%Y"))
}

#calling the function to change the date
data$recorded_date <- change_to_date(data$recorded_date)

Task 7- Drawing Plots

Data scientists should be able to create visualizations of their data and build observations based on their visualizations. Hence creating plots is one of the most important tools for a data scientist. In python, the library matplotlib is used for drawing plots, whereas ggplot2 is famous for drawing plots in R.

Python Code

#load library
import matplotlib.pyplot as plt

# Create a plot with x and y axes from dataframe
data.plot(x= 'recorded_date', y='enterococci')

#show the plot
plt.show()

Output is as follows:

R Code

# Load the ggplot2 library for creating plots
library(ggplot2)

# Create a plot with x and y axes from dataframe
ggplot(data, aes(recorded_date, enterococci))+ geom_line()

Output is as follows:

Task 8- Joining dataframes

Joining dataframes is an important task for Data Scientists. Data Scientists do not work with only one dataset. There are numerous datasets that need to be combined in order to study the relationships between data. Hence, joining different datasets is very operation that a data scientist does.

Python Code

#segregating the dataframe; 

# 1) selecting distinct from the dataset data
beach_info = data1 >> distinct(X.beach_id, X.region, X.council_name, X.site, X.lon, X.lat)

# 2) selecting remaining columns
bacteria_info= data1 >> select (X.beach_id, X.recorded_date, X.enterococci)

# Join Data Frames beach_info and bacteria_info
## The value of how can be changed to left, right, outer depending on the need
pd.merge(beach_info, bacteria_info, left_on = "beach_id", right_on = "beach_id", how = 'inner')

R Code

#segregating the dataframe; 

# 1) selecting distinct from the dataset data
beach_info <- data %>% distinct(beach_id, region, council_name, site, lon, lat)

# 2) selecting remaining columns
bacteria_info <- data %>% select (beach_id, recorded_date, enterococci)

# Join Data Frames beach_info and bacteria_info (inner join)
## An additional parameter can be sent to the merge function to specify join type as:
## all.x= TRUE for left outer join
## all.y= TRUE for right outer join
## all= TRUE for full outer join
merge(x = beach_info, y = bacteria_info, by.x = "beach_id", by.y = "beach_id")

Task 9- Aggregating data in Dataframes

It is required to aggregate dataframes based on values of different columns. The aggregated data can be used to perform further calculations and also used to generate visualizations. Functions such as min, max, mean can also be used in a similar manner.

Python Code

# Adding enterococci by Site
data1.groupby(['site']).agg({'enterococci': "sum"})

R Code

# Adding enterococci by Site
aggregate(data$enterococci, by=list(Site=data$site), FUN=sum, na.rm = TRUE)

Task 10- Help

Finally, all data scientists need help, and its better if they can get it just by a few keystrokes. Documentation can be a reliable source of information which is readily available with few keystrokes.

Python Code

# Opening help for read_csv function's help in Python
help("pandas.read_csv")

R Code

# Opening help for read.csv function's help in R
?read.csv

Cheat Sheet for Data Scientist- 10 tasks

Equivalent codes in R and Python

Abhisek Gautam

28/09/2019

Programming Cheat Sheets

Task 1- Loading data files

Python Code

R Code

Task 2- Peeking into the loaded data

Python Code

R Code

Task 3- Renaming Columns for Ease of use

Python Code

R Code

Task 4- Manipulating the data

Python Code

R Code

Task 5- Formatting data types

Python Code

R Code

Task 6- Creating functions

Python Code

Task 7- Drawing Plots

Python Code

R Code

Task 8- Joining dataframes

Python Code

R Code

Task 9- Aggregating data in Dataframes

Python Code

R Code

Task 10- Help

Python Code

R Code