Exercise
============

.. role:: underline
     :class: underline

.. role:: emphasis
     :class: emphasis


.. rubric:: Introduction to BLAST using human leptin
     :class: header1

:Authors: Justin R. DiAngelo, Alexis Nagengast, Wilson Leung
:Last Update: |today|
:Version: |version|


What is BLAST?
----------------------------------------

:term:`BLAST` stands for :emphasis:`B`\ asic :emphasis:`L`\ ocal
:emphasis:`A`\ lignment :emphasis:`S`\ earch :emphasis:`T`\ ool
and it is a program that reports regions of similarity (at
the nucleotide or protein level) between a query (your input) sequence
and sequences within a database. BLAST uses a robust statistical
framework that determines if the alignment between two sequences is
statistically significant (i.e. has a low probability of the reported
alignment being produced by chance alone). The ability to detect
sequence similarity allows scientists to determine if a :term:`gene` or a
protein is related to other known genes or proteins in the same species
or between species.

The theory of evolution is based on all organisms descending from common
ancestors by speciation. At the molecular level, an ancestral DNA
sequence diverges over time (through accumulation of point
:term:`mutations <Mutation>`, duplications, deletions, transpositions,
recombination events, etc.) to produce diverse sequences in the
genomes of living organisms. Such sequences are classified as
:term:`homologs <Homologs>` if they come from the same ancestral gene.

Mutations in genes with an important biological function have a higher
probability of being harmful to the organism and are less likely to
become fixed in a population. Such sequences are said to be under
*negative* selection, which causes them to be *conserved* against change
over time. Therefore, it is expected that two homologous copies of a
functional sequence will show a higher degree of sequence conservation
(observed as base-by-base similarity at the nucleotide level) than
either two unrelated sequences or two sequences that are not under
strong negative selection. This similarity is the "signal" detected by a
BLAST search.


Obtaining sequence using NCBI
----------------------------------------

.. admonition:: Learning Objectives
   :class: admonition-learning

   - Obtain a desired DNA or Protein sequence from the NCBI public
     database

The National Center for Biotechnology Information (:term:`NCBI`) is a
public database that houses molecular biology information including
sequences from thousands of different species from mammals to fungi. We
will explore some of the basic functionalities of the NCBI web site
using leptin (*LEP*) — a gene that has been found to contain mutations
associated with severe obesity and the development of type 2 diabetes.
First, open a web browser and navigate to the NCBI web site at
`https://www.ncbi.nlm.nih.gov <https://www.ncbi.nlm.nih.gov/>`__. To get
information on obesity, click on the `Genetics and
Medicine <https://www.ncbi.nlm.nih.gov/guide/genetics-medicine/>`__ link
on the left (:numref:`Figure %s <figure_1>`),
scroll about 1/3 of the way down the page and click on the
`Genes and Disease <https://www.ncbi.nlm.nih.gov/books/NBK22183/>`__
link. Scroll down the page and click on the "\ `Nutritional and
Metabolic Diseases <https://www.ncbi.nlm.nih.gov/books/NBK22259/>`__\ "
link, then click on the
"\ `Obesity <https://www.ncbi.nlm.nih.gov/books/NBK22251/>`__\ " link to
find a non-technical description of the hormone leptin and its role in
weight control. On the right panel, you will find links to other parts
of NCBI that contain more information about this disease. For example,
you can access the corresponding gene record in the OMIM database (a
catalog of human genes and disorders) through the "OMIM" link.


.. figure:: /_static/images/exercise/Figure1.png
   :alt: NCBI Home Page
   :name: figure_1

   Click on the "Genetics & Medicine" link to access the
   "Genes and Disease" database.


.. admonition:: Question 1
   :class: admonition-question

   Based on the information on this page, how does *leptin* control
   feeding?


To obtain the sequence for the human *LEP* gene, go back to the NCBI
homepage at https://www.ncbi.nlm.nih.gov/. At the top of the page, use
the pull-down menu next to "Search" and select ``Gene``. Enter ``LEP homo
sapiens`` into the text box and click Search
(:numref:`Figure %s <figure_2>`).

.. note::
   This search menu is also available at the top of the NCBI
   Bookshelf page. Hence you could scroll to the top of the page, click on
   the drop-down box to select ``Gene`` and then search for ``LEP homo
   sapiens`` directly. You can also access the Entrez gene record through
   the "Entrez Gene" link under the "Gene sequence" section on the right
   panel.

.. figure:: /_static/images/exercise/Figure2.png
   :alt: Search for LEP in Homo Sapiens
   :name: figure_2

   Search for "LEP homo sapiens" in the NCBI Gene database


Scroll down to the "Search results" section. This search produces 54
results, with the first entry being the gene record for human *leptin*.
The remaining results may mention *leptin* in their detailed summary.
Even before we click on the *leptin* entry, we can already obtain some
useful information about the *leptin* gene from the search results page.
For example, the chromosomal location and OMIM entry number for *leptin*
are shown.

.. admonition:: Question 2
   :class: admonition-question

   According to these search results, on which chromosome is the
   *leptin* gene located?


Click on the first match to the human *LEP* gene to learn more about
this gene and its sequence. The Entrez Gene record for *LEP* shows lots
of detailed information about gene structure and function.

The first section of the Entrez Gene record is the Summary section and
it contains:

- The official symbol and name approved by HUGO Gene Nomenclature
  Committee (HGNC)

- Other synonyms that have been used to describe this gene

- The organism the gene is from and the lineage of that organism

- Links to external databases (*e.g.*, HGNC and Ensembl) that contain
  similar information

- Link to the NCBI's Online Mendelian Inheritance in Man (MIM) that
  provides comprehensive information aimed for the medical and
  scientific research community

- :term:`RefSeq` status. The NCBI Reference Sequence :term:`Database`
  (RefSeq) is a comprehensive, curated database of non-redundant
  sequence records. The accession number for a RefSeq record begins
  with two characters, follow by an underscore. This prefix denotes
  the type of sequence record:

  - chromosomes: NC\_

  - genomic regions: NG\_

  - mRNA: NM\_

  - proteins: NP\_

The "Genomic regions, transcripts, and products" section has a map of
the gene with links to the sequence. The sequence record can be
retrieved in many different formats, including :term:`FASTA`
(a simple text format which contains only the sequence — easiest for
BLAST searching) or GenBank (more descriptive version with the
sequence listed at the end). The *leptin* gene is on the
:term:`positive strand <Plus (+) Strand>` of the chromosome (meaning
that if the chromosome was oriented from left to right, your gene
would be on the strand running from 5' to 3'). Specifically, the
*leptin* gene can be found at 128,241,201-128,257,629 of human
chromosome 7 in the current assembly (GRCh38.p13).


.. admonition:: Question 3
   :class: admonition-question

   How do you know that the *leptin* gene is on the
   + strand of the chromosome it's located on?


.. admonition:: Question 4
   :class: admonition-question

   The picture depicts two annotated transcripts (XM_005250340.5 and
   NM_000230.3). What is the difference between the RefSeq record that
   begins with the XM\_ prefix and the record that begins with the NM\_
   prefix? Both mRNA records have 4 green boxes, 2 light green in color (1 on each
   end) and 2 dark green. What do you think each of those boxes refers to?

.. tip::

   - Examine the "COMMENT" section of the GenBank record or the `RefSeq FAQ page
     <https://www.ncbi.nlm.nih.gov/books/NBK50679/#RefSeqFAQ.what_is_the_difference_between>`__
     to determine the difference between the "NM\_" and "XM\_" RefSeq records.

   - Click on the green boxes in the diagram to get more information on the *LEP* gene.


Sequences are most conserved between species at the amino acid level and
this is what we will use in our BLAST searches. The protein sequences
for the two annotated transcripts are available under the "NCBI
Reference Sequences (RefSeq)" section. Note that there are multiple
options to obtain the protein sequence for the leptin precursor,
including the RefSeq protein database (NP_000221.1), the Consensus CDS
(CCDS) database (CCDS5800.1), the UniProtKB/Swiss-Prot database (P41159)
and the UniProtKB/TrEMBL database (A4D0Y8).

To do your own BLAST search, scroll down to the "NCBI Reference
Sequences (RefSeq)" section and click on the ``NP_000221.1`` link.

.. admonition:: Question 5
   :class: admonition-question

   Why do you think you want the protein sequence as opposed to
   the nucleotide sequence?


This screen gives you information on the LEP protein sequence. Click on
the FASTA link directly beneath the gene name at the top of the screen.
This gives you the FASTA definition line (as indicated by >) followed by
the one letter amino acid sequence. The definition line varies but it
always begins with the accession number of the sequence record (*e.g.*,
NP_000221.1), followed by the gene name and the organism. Copy the
sequence including the definition line for use in a BLAST search.


Performing a BLAST search
----------------------------------------

.. admonition:: Learning Objectives
   :class: admonition-learning

   -  Perform a BLAST search to identify similar sequences

   -  Interpret results from BLAST output and draw conclusion of
      significance


Open a new tab in your web browser, go back to
https://www.ncbi.nlm.nih.gov/ and click on the ``BLAST`` link in the
"Popular Resources" section on the right. There are many different
options for a BLAST search and the option you choose depends on the
sequence you have and the degree of conservation you are trying to
detect. Start by clicking on the ``Protein BLAST`` image under the "Web
BLAST" section and paste your sequence into the ``Enter Query Sequence``
text box (:numref:`Figure %s <figure_3>`).
Alternatively, you could enter the accession number (i.e.,
NP_000221.1) into the text box. If you think you will be doing many
searches at once, you can enter a Job Title to distinguish one from the
other; if not, the FASTA definition line will be used.

.. figure:: /_static/images/exercise/Figure3.png
   :alt: Configure the NCBI blastp search
   :name: figure_3

   Configure NCBI blastp to search the human leptin protein against the
   nr protein database.

.. admonition:: Archived
   :class: admonition-archived

   For teaching purposes, the result of this blastp search against the
   nr database has been archived and is available at the
   `CyVerse Data Store
   <https://data.cyverse.org/dav-anon/iplant/commons/community_released/gea/intro-blast-human-leptin/ncbi-web-blast-archives/blastp_Human_Leptin_nr.html>`__.


Among the databases available, the non-redundant protein database is
the most comprehensive; it contains sequence records from most of the
other databases listed in the drop-down box:

.. table:: Databases available for NCBI blastp searches
      :widths: 30 70

      +-----------------------------------+-----------------------------------+
      | Database                          | Description                       |
      +===================================+===================================+
      | Non-redundant protein sequences   | non-redundant merge of protein    |
      | (nr)                              | sequences from the NCBI’s GenBank |
      |                                   | coding sequence translations,     |
      |                                   | RefSeq proteins, Swiss-Prot, PDB, |
      |                                   | PIR (Protein Information Resource |
      |                                   | at Georgetown University Medical  |
      |                                   | Center), and PRF (Protein         |
      |                                   | Research Foundation)              |
      +-----------------------------------+-----------------------------------+
      | Reference proteins                | manually curated database from    |
      | (refseq_protein)                  | NCBI                              |
      +-----------------------------------+-----------------------------------+
      | Model Organisms (landmark)        | protein sequences from 27 genomes |
      |                                   | that cover a wide taxonomic range |
      |                                   | (from bacteria to human)          |
      +-----------------------------------+-----------------------------------+
      | UniProtKB/Swiss-Prot (swissprot)  | manually annotated and reviewed   |
      |                                   | section of the UniProtKB database |
      +-----------------------------------+-----------------------------------+
      | Patented protein sequences (pat)  | proteins from the Patent division |
      |                                   | of GenBank                        |
      +-----------------------------------+-----------------------------------+
      | Protein Databank proteins (pdb)   | amino acid sequences derived from |
      |                                   | 3-dimensional structures from the |
      |                                   | Brookhaven Protein Data Bank      |
      +-----------------------------------+-----------------------------------+
      | Metagenomic proteins (env_nr)     | protein sequences from            |
      |                                   | environmental samples             |
      +-----------------------------------+-----------------------------------+
      | Transcriptome Shotgun Assembly    | transcripts assembled from ESTs   |
      | proteins (tsa_nr)                 | and second generation sequencing  |
      |                                   | reads (*e.g.*, Illumina)          |
      +-----------------------------------+-----------------------------------+


Under the "Choose Search Set" section, select the ``non-redundant protein
sequences (nr)`` option from the "Database" drop-down menu. Then click on
the blue BLAST button to start your search. You will be taken to a page
that will update every few seconds until the BLAST search is complete.
The BLAST results page will appear once the search is complete
(:numref:`Figure %s <figure_4>`).

.. figure:: /_static/images/exercise/Figure4.png
   :alt: The blastp search result against the nr database
   :name: figure_4

   The search results page for the blastp search of the human leptin protein
   (query) against the nr database.


The top left panel provides an overview of the BLAST search
parameters, and to download the BLAST results in multiple formats
(*e.g.*, Text, Hit Table). The top right panel can be used to filter
the matches by organism, percent identity, and E-value. The bottom
panel organizes the BLAST matches into four tabs:

Descriptions:
      This tab contains a summary of the sequences in the
      database that produced significant alignments with the query
      sequence. Each row provides an overview of the match, including the
      sequence description, score, query coverage, E-value, percent
      identity, and accession number. Use the checkbox at the left side of
      each row to select or unselect a BLAST match. The other tabs will
      only show the information for the BLAST matches that are selected in
      the "Descriptions" tab.

Graphic Summary:
      This tab shows a graphical overview of the alignments between the
      subject sequences in the database and the query sequence. Each row
      corresponds to a different subject sequence. The color of the
      boxes in each row correspond to the alignment score, where warmer
      colors (*e.g.*, red) denote more significant matches.

Alignments:
      This tab shows the local sequence alignments between the subject
      sequences in the database and the query sequence.

Taxonomy:
      This tab provides an overview of the taxonomic distribution of the
      BLAST matches. Click on the buttons next to the "Reports" label to
      view the taxonomic groupings in three different formats. The
      "Lineage" report groups the matches by their taxonomic lineage
      (*e.g.*, Primates). The "Organism" report groups the matches by
      species. The "Taxonomy" report is similar to the "Lineage" report,
      but you can use the "-" and "+" icons next to the non-leaf nodes
      in the "Taxonomy" column to collapse or expand the node.


Now that we are familiar with the different sections of the BLAST
results page, we will examine the blastp search results for the human
LEP protein sequence (query) against the nr protein database (subjects).
Click on the ``Graphic Summary`` tab to obtain a graphical overview of the
search result. Before performing the blastp search, NCBI BLAST will
search the query sequence against the Conserved Domains Database (CDD).
If the protein sequence contains a conserved domain, a graphical
representation of the alignment between the query sequence and the
conserved domain will appear at the top of the "Graphic Summary" tab.
Click on the Leptin conserved domain match in the graphic to learn more
about this conserved domain. The CDD search results page shows that the
Leptin conserved domain aligns with residues 26–167 of the query
sequence.

Go back to the ``Graphic Summary`` tab of the blastp results page. The
graphical output underneath the conserved domain image shows the
database entries (subjects) that align with your input LEP sequence
(query) on a color scale. Features with highly statistically significant
alignments will appear in red in the graphical output while features
with the lowest statistical significance will appear in black. Move your
mouse so that it is above the first red bar below the scale. A tooltip
will appear which shows the sequence title. Click on the feature to see
a more detailed description of the match (*e.g.*, score, E-value,
accession), as well as a link to navigate directly to the corresponding
alignment (:numref:`Figure %s <figure_5>`).

.. figure:: /_static/images/exercise/Figure5.png
   :alt: Graphic Summary of the blastp search result
   :name: figure_5

   Click on a feature in the graphical overview diagram to obtain more
   information on the blastp match.


Click on the ``Descriptions`` tab to view a table summary of the blastp
matches. The first entry in the "Descriptions" table shows the best
match to the query sequence (leptin precursor [*Homo sapiens*]). The
match has a score (S) of 340 and an expect (E) value of 3e-118.
Alignments with smaller E-values are more statistically significant and
are less likely to have arisen by chance. Specifically, the E-value
denotes the number of times we expect to see an alignment with scores
equal to or greater than S when we align random sequences against each
other. By default, NCBI BLAST will report matches with an E-value of
less than 10. The last column of the "Descriptions" table contains the
accession number for the match (NP_000221.1), and the link allows you to
access the corresponding GenPept record.

The second entry in the "Descriptions" table shows a match to the
leptin precursor in chimpanzee (*Pan troglodyte*). This match has a
total score of 339 and an E-value of 1e-117. Because the accession
number for this match (NP_001180601.1) begins with "NP\_", this
protein record is from the RefSeq database.

Click on the ``leptin precursor [Pan troglodytes]`` link under the
"Description" column to navigate to the alignment between your input
leptin sequence (query) and the leptin precursor in chimp (subject). The
header in the alignment output shows that the chimp leptin precursor
sequence has a total length of 167 amino acids. The alignment has an
E-value of 1e-117 and there is a single amino acid substitution between
the human and chimp leptin precursors (from V to M). This high degree of
sequence similarity is unlikely to be caused by chance alone.

The alignment is shown as three separate lines; the first (Query) line
is our human sequence, the third (Sbjct) line is the chimpanzee sequence
and the middle (Match) line represents a comparison of the two sequences
(if the amino acids are the same between the two sequences, that amino
acid will be included in the middle line). The value in the "Identities"
field corresponds to the number of identical amino acids between the
query and the subject sequences. In this case, this field (166/167)
indicates that, of the 167 amino acids in the alignment between the
human and chimp sequences, 166 of the amino acids (99%) are identical.
The one difference between these two sequences can be found at position
94 of the query (as a + in the Match line), where the human sequence is
a valine (V) and the chimp sequence is a methionine (M).

.. admonition:: Question 6
   :class: admonition-question

   What is the relationship between V and M that warranted a +
   designation?


Click on the ``Descriptions`` tab in the blue toolbar. To investigate
protein matches to water buffalo (with the scientific name *Bubalus
bubalis*), enter ``Bubalus bubalis (taxid:89462)`` into the "Organism"
text box under the "Filter Results" section and then click on the
``Filter`` button.

.. figure:: /_static/images/exercise/Figure6.png
   :alt: Use the filter functionality to only show matches to Bubalus bubalis
   :name: figure_6

   Identify the best match to the human leptin protein in water buffalo.


The best RefSeq match is to the leptin precursor in *Bubalus bubalis*
with the accession number NP_001277830.1. The E-value is still very
highly significant at 6e-96
(:numref:`Figure %s <figure_6>`). Click on the link under the "Description"
column to navigate to the alignment. The *Bubalus bubalis* leptin
precursor protein has the same length (167aa) as the human leptin
precursor protein, and the entire protein is included in the alignment.
However, there are many more changes and the percent identity are
substantially lower than the alignment to the chimp leptin precursor
protein.


.. admonition:: Question 7
   :class: admonition-question

   What is the percent identity and percent positives (aka percent
   similarity) of the *Bubalus bubalis* leptin sequence compared to the human
   sequence?


You can also perform searches against the database of specific
organisms. Go to the BLAST page at
https://blast.ncbi.nlm.nih.gov/Blast.cgi and enter the name of the
organism in the text box under the "BLAST Genomes" section to search its
databases. For example, enter ``chimpanzee (taxid:9598)`` into the text
box and then click on the ``Search`` button to search the chimp assembly.
To search for protein matches, click on the ``blastp`` tab, paste the
human leptin precursor sequence into the ``Enter Query Sequence`` text
box, and then verify that the ``RefSeq protein`` option is selected under
the "Database" field.

Because the database is much smaller (with 64,958 sequences) than the nr
database, this search will run much faster than our previous search.
Click on the BLAST button to begin your search.

.. admonition:: Archived
   :class: admonition-archived

   For teaching purposes, the result of this blastp search
   against the chimpanzee RefSeq protein database
   has been archived and is available at the
   `CyVerse Data Store
   <https://data.cyverse.org/dav-anon/iplant/commons/community_released/gea/intro-blast-human-leptin/ncbi-web-blast-archives/blastp_Human_Leptin_Chimp_RefSeq.html>`__.


.. admonition:: Question 8
   :class: admonition-question

   How many hits came out of this search? Are all of them significant
   matches?  Does the BLAST result support the hypothesis that chimps
   have a homolog of the human *leptin* gene?


.. tip::

   In general, we consider matches with E-values
   less than 1 x 10\ :sup:`-5` as statistically significant.


To further support your hypothesis that chimps have a homolog of the
human leptin precursor gene, you can use the putative chimp leptin
precursor sequence and do a blastp search against the human RefSeq
protein sequence database. To do this, select the ``Alignments`` tab and
then click on the GenPept link in the blue toolbar above the sequence
you believe to be the chimp leptin precursor to retrieve the GenPept
protein record. Click on the ``FASTA`` link under the gene name to obtain
the full chimp protein sequence in FASTA format. Copy the sequence and
go back to the NCBI BLAST page and click on the ``Human`` link under the
"BLAST Genomes" section. Select the blastp program, verify that the
``RefSeq protein`` option is selected under the "Database" field, and then
paste the chimp leptin precursor sequence into the ``Enter Query
Sequence`` text box.

Click on the BLAST button to perform your search.

.. admonition:: Archived
   :class: admonition-archived

   For teaching purposes, the result of this blastp search
   against the human RefSeq protein database
   has been archived and is available at the
   `CyVerse Data Store
   <https://data.cyverse.org/dav-anon/iplant/commons/community_released/gea/intro-blast-human-leptin/ncbi-web-blast-archives/blastp_Chimp_Leptin_Human_RefSeq.html>`__.


.. admonition:: Question 9
   :class: admonition-question

   Did you obtain matches from this search? Are they significant
   matches? How do you know? Does this data support your hypothesis about
   the presence of a leptin homolog in chimps?


Information accessed August 2, 2019.