Final portfolio assignment: Atypical chemokine receptor 1

Introduction

This code compiles summary information about the gene ACKR1 (Atypical chemokine receptor 1). The protein encoded by the ACKR1 gene is a glycosylated membrane protein and a non-specific receptor for several chemokines. The protein is the receptor for the human malarial parasites Plasmodium vivax and Plasmodium knowlesi. Polymorphisms in this gene are the basis of the Duffy blood group system

It also generates alignments and a phylogeneitc tree to indicating the evolutionary relationship betweeen the human version of the gene and its homologs in other species.

Resources / References

Key information use to make this script can be found here: * Refseq Gene: https://www.ncbi.nlm.nih.gov/gene/2532 * Refseq Homologene: https://www.ncbi.nlm.nih.gov/homologene/48067

Other resources consulted includes

Neanderthal genome: http://neandertal.ensemblgenomes.org/index.html

Other interesting resources and online tools include:

REPPER: https://toolkit.tuebingen.mpg.de/jobs/7803820
Sub-cellular locations prediction: https://wolfpsort.hgc.jp/

Preparation

# github packages
library(compbio4all)
library(ggmsa)

## Registered S3 methods overwritten by 'ggalt':
##   method                  from   
##   grid.draw.absoluteGrob  ggplot2
##   grobHeight.absoluteGrob ggplot2
##   grobWidth.absoluteGrob  ggplot2
##   grobX.absoluteGrob      ggplot2
##   grobY.absoluteGrob      ggplot2

# CRAN packages
library(rentrez)
library(seqinr)
library(ape)

## 
## Attaching package: 'ape'

## The following objects are masked from 'package:seqinr':
## 
##     as.alignment, consensus

library(pander)
library(ggplot2) 

# Bioconductor packages
library(msa)

## Loading required package: Biostrings

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min

## Loading required package: S4Vectors

## Loading required package: stats4

## 
## Attaching package: 'S4Vectors'

## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname

## Loading required package: IRanges

## Loading required package: XVector

## Loading required package: GenomeInfoDb

## 
## Attaching package: 'Biostrings'

## The following object is masked from 'package:ape':
## 
##     complement

## The following object is masked from 'package:seqinr':
## 
##     translate

## The following object is masked from 'package:base':
## 
##     strsplit

library(drawProteins)
## Biostrings
library(Biostrings)
library(HGNChelper)

Accession numbers

Accession numbers were obtained from RefSeq, Refseq Homlogene, UniProt and PDB. UniProt accession numbers can be found by searching for the gene name. PDB accessions can be found by searching with a UniProt accession or a gene name, though many proteins are not in PDB. The the Neanderthal genome database was searched and did yield sequence information.

A protein BLAST search (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome) was carried out excluding vertebrates to determine if it occurred outside of vertebreates. The gene does not appear in non-vertebrates and so a second search was conducted to exclude mammals.

Accession Table

ACKR1_table<-c("NP_002027.2",   "Q16570","4NUU","Homo sapiens" ,    "Human",      "ACKR1",
              "XP_016785546.1","A0A2J8K5T9",    "NA", "Pan troglodytes" , "Chimpanzee","ACKR1",
              "NP_034175.2",   "Q9QUI6","NA","Mus musculus",      "Mouse"    ,"ACKR1",
              "XP_002728079.3","NA","NA","Rattus norvegicus", "Rat",      "ACKR1",
              "NP_477500.1","O16868","NA","Drosophila melanogaster","Fruit fly",     "fy",
              "NP_001015634.1","Q9GLX0","NA","Bos taurus",       "Cattle",     "ACKR1", 
              "XP_001117200.1","NA",    "NA", "Macaca mulatta" , "Rheus monkey","ACKR1", 
              "XP_005641007.1","NA",    "NA", "Canis lupus" , "Dog","ACKR1", 
              "XP_003821027.1","A0A2R9AUQ0",    "NA", "Pan paniscus" , "Bonobo","ACKR1", 
              "ENSP00000352341","Q16570",    "4NUU", "Homo neanderthalensis" , "Neaderthal","ACKR1")
              
RefSeq<-c("NP_002027.2","XP_016785546.1", "NP_034175.2", "XP_002728079.3","NP_477500.1", "NP_001015634.1", "XP_001117200.1", "XP_005641007.1", "XP_003821027.1", "ENSP00000352341" )
UniProt<- c("Q16570","A0A2J8K5T9", "Q9QUI6", "NA", "O16868", "Q9GLX0", "NA", "NA","A0A2R9AUQ0", "Q16570")
PDB<-c("4NUU", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "4NUU")
Scientific_Name<-c("Homo sapiens","Pan troglodytes", "Mus musculus", "Rattus norvegicus", "Drosophila melanogaster", "Bos taurus","Macaca mulatta", "Canis lupus",  "Pan paniscus", "Homo neanderthalensis")
Common_Name<-c("Human","Chimpanzee", "Mouse", "Rat", "Fruit fly","Cattle", "Rheus monkey", "Dog", "Bonobo", "Neaderthal")
Gene_Name<- c("ACKR1", "ACKR1", "ACKR1", "ACKR1", "fy", "ACKR1", "ACKR1", "ACKR1", "ACKR1", "ACKR1" )
ACKR1.df<- data.frame(RefSeq=RefSeq,
                      UniProt=UniProt,
                      PDB=PDB,
                      Scientific_Name=Scientific_Name,
                      Common_Name=Common_Name,
                      Gene_Name=Gene_Name)
pander::pander(ACKR1.df)

Table continues below
RefSeq	UniProt	PDB	Scientific_Name	Common_Name
NP_002027.2	Q16570	4NUU	Homo sapiens	Human
XP_016785546.1	A0A2J8K5T9	NA	Pan troglodytes	Chimpanzee
NP_034175.2	Q9QUI6	NA	Mus musculus	Mouse
XP_002728079.3	NA	NA	Rattus norvegicus	Rat
NP_477500.1	O16868	NA	Drosophila melanogaster	Fruit fly
NP_001015634.1	Q9GLX0	NA	Bos taurus	Cattle
XP_001117200.1	NA	NA	Macaca mulatta	Rheus monkey
XP_005641007.1	NA	NA	Canis lupus	Dog
XP_003821027.1	A0A2R9AUQ0	NA	Pan paniscus	Bonobo
ENSP00000352341	Q16570	4NUU	Homo neanderthalensis	Neaderthal

Gene_Name
ACKR1
ACKR1
ACKR1
ACKR1
fy
ACKR1
ACKR1
ACKR1
ACKR1
ACKR1

Data prepartation

Download sequences

All sequences were downloaded using a wrapper compbio4all::entrez_fetch_list() which uses rentrez::entrez_fetch() to access NCBI databases.

ACKR1_human<- rentrez::entrez_fetch(id = "NP_002027.2",
                                      db = "protein", 
                                      rettype="fasta")
ACKR1_chimpanzee<- rentrez::entrez_fetch(id = "XP_016785546.1",
                                      db = "protein", 
                                      rettype="fasta")
ACKR1_mouse<- rentrez::entrez_fetch(id = "NP_034175.2",
                                      db = "protein", 
                                      rettype="fasta") 
ACKR1_rat<- rentrez::entrez_fetch(id = "XP_002728079.3",
                                      db = "protein", 
                                      rettype="fasta")
ACKR1_fruitfly<- rentrez::entrez_fetch(id = "NP_477500.1",
                                      db = "protein", 
                                      rettype="fasta")   
ACKR1_cattle<- rentrez::entrez_fetch(id = "NP_001015634.1",
                                      db = "protein", 
                                      rettype="fasta")
ACKR1_rheusmonkey<- rentrez::entrez_fetch(id = "XP_001117200.1",
                                      db = "protein", 
                                      rettype="fasta")
ACKR1_dog<- rentrez::entrez_fetch(id = "XP_005641007.1",
                                      db = "protein", 
                                      rettype="fasta") 
ACKR1_bonobo<- rentrez::entrez_fetch(id = "XP_003821027.1",
                                      db = "protein", 
                                      rettype="fasta") 
#ACKR1_neanderthal<- rentrez::entrez_fetch(id = "ENSP00000352341",   #Won't work 
                                      #db = "protein", 
                                      #rettype="fasta")

Clean and set up sequence as vector.

ACKR1_human_vector<- fasta_cleaner(ACKR1_human)
ACKR1_chimpanzee_vector<- fasta_cleaner(ACKR1_chimpanzee)
ACKR1_mouse_vector<- fasta_cleaner(ACKR1_mouse)
ACKR1_rat_vector<- fasta_cleaner(ACKR1_rat)
ACKR1_fruitfly_vector<- fasta_cleaner(ACKR1_fruitfly)
ACKR1_cattle_vector<- fasta_cleaner(ACKR1_cattle)
ACKR1_rheusmonkey_vector<- fasta_cleaner(ACKR1_rheusmonkey)
ACKR1_dog_vector<- fasta_cleaner(ACKR1_dog)
ACKR1_bonobo_vector<- fasta_cleaner(ACKR1_bonobo)
#ACKR1_neanderthal_vector<- fasta_cleaner(ACKR1_neanderthal)

Confirm set up.

str(ACKR1_human_vector)

##  chr [1:336] "M" "G" "N" "C" "L" "H" "R" "A" "E" "L" "S" "P" "S" "T" "E" ...

str(ACKR1_chimpanzee_vector)

##  chr [1:336] "M" "G" "N" "C" "L" "H" "R" "A" "E" "L" "S" "P" "S" "T" "E" ...

str(ACKR1_mouse_vector)

##  chr [1:334] "M" "G" "N" "C" "L" "Y" "P" "V" "E" "N" "L" "S" "L" "D" "K" ...

str(ACKR1_rat_vector)

##  chr [1:331] "M" "G" "N" "C" "L" "H" "P" "V" "E" "R" "F" "S" "V" "D" "N" ...

str(ACKR1_fruitfly_vector)

##  chr [1:416] "M" "S" "I" "Y" "L" "L" "C" "L" "T" "T" "N" "G" "G" "L" "P" ...

str(ACKR1_cattle_vector)

##  chr [1:330] "M" "G" "N" "C" "L" "Y" "P" "V" "A" "D" "D" "N" "S" "T" "K" ...

str(ACKR1_rheusmonkey_vector)

##  chr [1:335] "M" "G" "N" "C" "L" "H" "P" "A" "E" "L" "S" "P" "S" "T" "Q" ...

str(ACKR1_dog_vector)

##  chr [1:433] "M" "G" "N" "C" "L" "Q" "Q" "Q" "A" "A" "D" "E" "A" "S" "T" ...

str(ACKR1_bonobo_vector)

##  chr [1:336] "M" "G" "N" "C" "L" "H" "R" "A" "E" "L" "S" "P" "S" "T" "E" ...

# str(ACKR1_neanderthal_vector)

General Protein information

Protein diagram

Building a protein diagram

Q16570_json  <- drawProteins::get_features("Q16570 ")

## [1] "Download has worked"

is(Q16570_json)

## [1] "list"             "vector"           "list_OR_List"     "vector_OR_Vector"
## [5] "vector_OR_factor"

Then the raw data from the webpage is converted to a dataframe

my_prot_df <- drawProteins::feature_to_dataframe(Q16570_json)
is(my_prot_df)

## [1] "data.frame"       "list"             "oldClass"         "vector"          
## [5] "list_OR_List"     "vector_OR_Vector" "vector_OR_factor"

my_prot_df[,-2]

##                     type begin end length accession   entryName taxid order
## featuresTemp       CHAIN     1 336    335    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.1  TOPO_DOM     1  63     62    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.2  TRANSMEM    64  84     20    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.3  TOPO_DOM    85  95     10    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.4  TRANSMEM    96 116     20    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.5  TOPO_DOM   117 129     12    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.6  TRANSMEM   130 153     23    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.7  TOPO_DOM   154 166     12    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.8  TRANSMEM   167 187     20    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.9  TOPO_DOM   188 207     19    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.10 TRANSMEM   208 228     20    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.11 TOPO_DOM   229 244     15    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.12 TRANSMEM   245 265     20    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.13 TOPO_DOM   266 287     21    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.14 TRANSMEM   288 308     20    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.15 TOPO_DOM   309 336     27    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.16 CARBOHYD    16  16      0    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.17 CARBOHYD    33  33      0    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.18 DISULFID    51 276    225    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.19 DISULFID   129 195     66    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.20  VAR_SEQ     1   7      6    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.21  VARIANT    42  42      0    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.22  VARIANT    89  89      0    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.23  VARIANT   100 100      0    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.24  VARIANT   203 203      0    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.25  VARIANT   326 326      0    Q16570 ACKR1_HUMAN  9606     1
## featuresTemp.26    HELIX    22  28      6    Q16570 ACKR1_HUMAN  9606     1

Domains Present

Protein has no domains, motif, but has helical folds

my_canvas <- draw_canvas(my_prot_df)
my_canvas <- draw_chains(my_canvas, my_prot_df, label_size = 2.5)

my_canvas <- draw_regions(my_canvas, my_prot_df)
my_canvas <- draw_motif(my_canvas, my_prot_df)
my_canvas <- draw_phospho(my_canvas, my_prot_df)
my_canvas <- draw_repeat(my_canvas, my_prot_df)
my_canvas <- draw_recept_dom(my_canvas, my_prot_df)
my_canvas <- draw_folding(my_canvas, my_prot_df)
my_canvas

##Draw dotplot ### Prepare data

par(mfrow = c(2,2), 
    mar = c(0,0,2,1))

# plot 1: ACKR1_human - Defaults
dotPlot(ACKR1_human_vector, ACKR1_human_vector, 
        wsize = 1, 
        nmatch = 1, 
        main = "ACKR1_human Defaults")

# plot - size = 10, nmatch = 1
dotPlot(ACKR1_human_vector, ACKR1_human_vector, 
        wsize = 10, 
        nmatch = 1, 
        main = " ACKR1_human- size = 10, nmatch = 1")

# plot 3: ACKR1_human - size = 10, nmatch = 5
dotPlot(ACKR1_human_vector, ACKR1_human_vector, 
        wsize = 10, 
        nmatch = 5, 
        main = "ACKR1_human - size = 10, nmatch = 5")

# plot 4: size = 20, nmatch = 5
dotPlot(ACKR1_human_vector, ACKR1_human_vector, 
        wsize = 10,
        nmatch = 5,
        main = "ACKR1_human - size = 20, nmatch = 5")

# reset par() - run this or other plots will be small!
par(mfrow = c(1,1), 
    mar = c(4,4,4,4))

dotPlot(ACKR1_human_vector, ACKR1_human_vector, 
        wsize = 1, 
        nmatch = 1, 
        main = "ACKR1_human Defaults")

## Protein properties compiled from databases

Pfam<- c("Pfam shows 7 transmembrane domians within the 62-308 regions and 2 low complexity regions at 220-241 and 277-275 respectively.")
UniProt<- c("UniProt sub-cellular locations:Early Endosome and recycling endosome")
Alphafold<- c("Alphafold predicts with high confidence that the proetin encoded by ACKR1 is complosed of mostly aplha helices (>90)")
pp.df<-data.frame (Pfam=Pfam,
                  UniProt=UniProt,
                  Alphafold=Alphafold)
pander::pander(pp.df)

Table continues below
Pfam	UniProt
Pfam shows 7 transmembrane domians within the 62-308 regions and 2 low complexity regions at 220-241 and 277-275 respectively.	UniProt sub-cellular locations:Early Endosome and recycling endosome

Alphafold
Alphafold predicts with high confidence that the proetin encoded by ACKR1 is complosed of mostly aplha helices (>90)

Predict Protein Folf

Chou and Zhang Data

# enter once
aa.1.1 <- c("A","R","N","D","C","Q","E","G","H","I",
            "L","K","M","F","P","S","T","W","Y","V")

# enter again
aa.1.2 <- c("A","R","N","D","C","Q","E","G","H","I",
            "L","K","M","F","P","S","T","W","Y","V")

# are they the same length?
length(aa.1.1) == length(aa.1.2)

## [1] TRUE

aa.1.1 == aa.1.2

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE

unique(aa.1.1)

##  [1] "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S" "T" "W" "Y"
## [20] "V"

length(unique(aa.1.1)) == length(unique(aa.1.2))

## [1] TRUE

# alpha proteins
alpha <- c(285, 53, 97, 163, 22, 67, 134, 197, 111, 91, 
           221, 249, 48, 123, 82, 122, 119, 33, 63, 167)

# check against chou's total
sum(alpha) == 2447

## [1] TRUE

# beta proteins
beta <- c(203, 67, 139, 121, 75, 122, 86, 297, 49, 120, 
          177, 115, 16, 85, 127, 341, 253, 44, 110, 229)

# check against chou's total
sum(beta) == 2776

## [1] TRUE

# alpha + beta
a.plus.b <- c(175, 78, 120, 111, 74, 74, 86, 171, 33, 93,
              110, 112, 25, 52, 71, 126, 117, 30, 108, 123)
sum(a.plus.b) == 1889

## [1] TRUE

# alpha/beta
a.div.b <- c(361, 146, 183, 244, 63, 114, 257, 377, 107, 239, 
             339, 321, 91, 158, 188, 327, 238, 72, 130, 378)
sum(a.div.b) == 4333

## [1] TRUE

# Calculate proportions for each of the four protein fold types

alpha.prop <- alpha/sum(alpha)
beta.prop <- beta/sum(beta)
a.plus.b.prop <- a.plus.b/sum(a.plus.b)
a.div.b<- a.div.b/sum(a.div.b)

#Create a dataframe

#dataframe
aa.prop <- data.frame(alpha.prop,
                      beta.prop,
                      a.plus.b.prop,
                      a.div.b)
#row labels
row.names(aa.prop)<-aa.1.1

pander(data.frame(aa.1.1, alpha, beta, a.plus.b, a.div.b))

aa.1.1	alpha	beta	a.plus.b	a.div.b
A	285	203	175	0.08331
R	53	67	78	0.03369
N	97	139	120	0.04223
D	163	121	111	0.05631
C	22	75	74	0.01454
Q	67	122	74	0.02631
E	134	86	86	0.05931
G	197	297	171	0.08701
H	111	49	33	0.02469
I	91	120	93	0.05516
L	221	177	110	0.07824
K	249	115	112	0.07408
M	48	16	25	0.021
F	123	85	52	0.03646
P	82	127	71	0.04339
S	122	341	126	0.07547
T	119	253	117	0.05493
W	33	44	30	0.01662
Y	63	110	108	0.03
V	167	229	123	0.08724

pander::pander(aa.prop)

	alpha.prop	beta.prop	a.plus.b.prop	a.div.b
A	0.1165	0.07313	0.09264	0.08331
R	0.02166	0.02414	0.04129	0.03369
N	0.03964	0.05007	0.06353	0.04223
D	0.06661	0.04359	0.05876	0.05631
C	0.008991	0.02702	0.03917	0.01454
Q	0.02738	0.04395	0.03917	0.02631
E	0.05476	0.03098	0.04553	0.05931
G	0.08051	0.107	0.09052	0.08701
H	0.04536	0.01765	0.01747	0.02469
I	0.03719	0.04323	0.04923	0.05516
L	0.09031	0.06376	0.05823	0.07824
K	0.1018	0.04143	0.05929	0.07408
M	0.01962	0.005764	0.01323	0.021
F	0.05027	0.03062	0.02753	0.03646
P	0.03351	0.04575	0.03759	0.04339
S	0.04986	0.1228	0.0667	0.07547
T	0.04863	0.09114	0.06194	0.05493
W	0.01349	0.01585	0.01588	0.01662
Y	0.02575	0.03963	0.05717	0.03
V	0.06825	0.08249	0.06511	0.08724

Functions to calculate similarities Two custom functions are needed: one to calculate correlates between two columns of our table, and one to calculate correlation similarities.

}

# Corrleation used in Chou adn Zhange 1992.
chou_cor <- function(x,y){
  numerator <- sum(x*y)
denominator <- sqrt((sum(x^2))*(sum(y^2)))
result <- numerator/denominator
return(result)
}

# Cosine similarity used in Higgs and Attwood (2005). 
chou_cosine <- function(z.1, z.2){
  z.1.abs <- sqrt(sum(z.1^2))
  z.2.abs <- sqrt(sum(z.2^2))
  my.cosine <- sum(z.1*z.2)/(z.1.abs*z.2.abs)
  return(my.cosine)}
  
#cor(chou_cor$result,chou_cosine$my.cosine)

Percent Identity Comparisons (PID)

Multiple sequence alignment

MSA data preparation

For use with R bioinformatics tools we need to convert our named vector to a string set using Biostrings::AAStringSet(). Note the _ss tag at the end of the object we’re assigning the output to, which designates this as a string set.

ACKR1_vector<-c(ACKR1_bonobo_vector, ACKR1_cattle_vector, ACKR1_chimpanzee_vector, ACKR1_dog_vector, ACKR1_fruitfly_vector, ACKR1_human_vector, ACKR1_mouse_vector, ACKR1_rat_vector, ACKR1_rheusmonkey_vector)
ACKR1_vector_ss <- Biostrings::AAStringSet(ACKR1_vector)

Building Multiple Sequence Alignment (MSA)

ACKR1_matrixmatrix <- matrix(ACKR1_vector)

#Cleaning / setting up an MSA
#msa produces a species MSA objects

#ACKR1_align <- msa(ACKR1_vector_ss,
                     #method = "ClustalW")
#class(ACKR1_align)