Practical -Exploring information resources in molecular
biology - Galaxy
Provide answers to questions that are indicated
in italics and marked with 'Q'.
Galaxy - What promoters contain TAF1 binding sites?
The following exercise is based on a tutorial here.
We assume that you have obtained in a CHIP-Seq experiment a list
of sites in the human genome where the transcription factor
TAF1 binds. (TAF1 = Transcription initiation factor TFIID
subunit 1). The list that you are to use in this exercise is here.
Start by saving it to your local computer. You now want to see
what gene promoters overlap with the TAF1-binding sites and you
want to view these in the UCSC browser.
We make use of Galaxy.
First use Tools => Get Data => Upload file => Browse.
Choose the file that you saved above. Click Execute.
To the right of the Galaxy screen there is a section labeled
"History". When an operation in Galaxy is complete the entry in
History will show up in green. When this has happened for your
uploaded file, click on the link to it to view some of its
properties. Click the question mark "?" right after the expression
"tabular, database:" to edit the properties of the data:
For the Database/ Build - select Human Mar. 2006
(hg18).
Click Save.
Click the pencil icon in the green area of
"taf1_CHIP.txt" to edit the data type. Change from the default
value to "interval". Click Save.
Also change Chrom column to "2", Start column to "3" and End
column to "4". Change Name/Identifier column to "5".
Click Save.
Next we will get gene annotation data from the UCSC browser.
Select "Get Data" from the main Tools menu. Follow the link to the UCSC Main table browser.
Select Mar. 2006 version of the genome and the RefSeq Genes
track of the "Genes and Gene prediction" Tracks group.
Select "genome" as the region. Set the output format to "BED
- browser extensible data". Click "get output". In the following
screen do not modify anything, just click "Send query to Galaxy".
Now gene coordinates will be uploaded to Galaxy. In the
resulting green window under "History" click the pencil icon and
rename the dataset to "Refseq". Click Save.
We are interested in promoter regions and we therefore want to
include a region upstream of the genes as listed in the Refseq
dataset. To do this, go to Tools => Operate on
Genomic Intervals => Get flanks. Specify "Length of the
flanking region(s)" as "1000". Click Execute. Rename
the new data set to "Promoters". Click Save.
In the next step we will clean up the Promoter data set by
extracting a limited number of columns from it. Therefore, go to
Tools => Text manipulation => Cut columns from a table. In
the "Cut columns" field enter "c1,c2,c3,c4,c6". Click execute.
Rename the dataset to "Clean promoters".
Then, it is time to join the promoter dataset with the TAF1
binding sites. To do that, go to Tools => Operate on
Genomic Intervals => Join. Specify as First and Second queries
taf1_CHIP.txt and "Clean promoters", respectively. Click Execute.
Q1. How
many regions are the result of the join procedure?
Finally, we will examine the results in a genomic context with
the help of the UCSC browser. We can simply do this by clicking on
the "UCSC main" link in the green window being the result of the
join operation.
You are now in a position to graphically view all
promoter regions in the genome that overlap with experimental data
on TAF1 binding.
Galaxy - What SNPs are unique to Bushmen?
This exercise is based on
the tutorial here.
Studies of human evolution show that the root of the human tree
occurs in Africa. Analysis of mitochondrial sequences shows that the
first lineage to branch off from the human phylogenetic tree is a
mitochondrial haplogroup named L0. This group is frequent among San
in Southern Africa and in the Sandawe of East Africa. The complete
genomes of Southern African individuals to be examined in this
exercise are of this old lineage. The individuals, named !Gubi,
G/aq'o, D#kgao and !Ai, are Namibian hunter-gatherers known as
Khoisan or Bushmen. They will be referred to below as KB1, NB1, TK1
and MD8, respectively. In addition to the Bushmen, the genome of
archbishop Desmond Tutu (below referrred to as ABT), a Bantu
individual, was sequenced in the same project.
The purpose of these exercises is to
1) Identify sites in
chromosome 7 where the bushmen KB1 and NB1 are different from the
reference sequence and from ABT.
2) Find exons that contain
these SNPs.
3) Examine more closely the
SNPs of one particular gene on chromosome 7.
We will be working with data that originated from a dataset
available from Galaxy (http://main.g2.bx.psu.edu/library). Data for
chromosome 7 was extracted and the resulting file is bushmen_chr7_all_snps.txt
. Save this file to your computer. It has 24 columns. Column 4 is
the human reference sequence, columns 6 and 8 are the KB1 and NB1
Bushmen and column 14 is ABT.
Connect to Galaxy. Go to
"Get data" => Upload File from your computer. As for the TAF1
data in the previous exercise specify the Database/ Build to be that
of Mar. 2006 (hg18).
The SNP data that you are about to analyze is a quite large dataset
with about 600,000 SNPs. First, we will filter the data. Select
"Filter and Sort" =>
"Filter data on any column using simple expressions"
Specify in the box "With following condition:" the following:
c4 != 'N' and c14 != 'N' and c6
!= 'N' and c8 != 'N'
(The symbol == means "equal to" and the symbol != means "not equal
to". ). 'N' is a symbol in the SNP table to denote that the base in
that position is unknown.
We want to filter out such positions because we only want to compare
positions where we know the base for each of the individuals that we
are examining.
Click "Execute" to start the fitering procedure. The filtering could
take a few minutes.
Q2. How
many SNPs remain after the filtering?
Next step is to identify the SNPs characteristic of the bushmen. We
can make yet another filtering as above but now with the
conditions:
c4 != c6 and c4 != c8 and c14
!= c6 and c14 != c8
Q3. How many SNPs remain after this filtering?
Rename the results of the last operation "SNPs".
The final step is to identify the SNPs that are in exons.
We obtain the exon information from UCSC. Go to "Get Data" =>
"UCSC Main". At UCSC interface, select group assembly Mar. 2006,
"Genes and Gene Prediction Tracks" and track "UCSC Genes". Make sure
chr7 is selected. Output format = BED. Click Get output. In the
resulting page select under "Create one BED record per:" "Exons
plus" instead of the default "Whole Gene". Click "Send query to
Galaxy".
Rename the resulting dataset to "Exons".
Go to "Operate on genomic intervals" => "Join". Join
the dataset "SNPs" with "Exons". Click "Execute".
Q4. How
many regions were identified in this final step?
The results may now be visualized by clicking on the link to "UCSC
main" in the green box.
Visualize the region chr7:141,318,330-141,320,555. It contains the
gene for the receptor TAS2R38.
You will find that this gene has SNPs
unique to the Bushmen. Examples are
chr7:141,319,813 rs713598 G>C
chr7:141,319,072 rs10246939 A>G
The TAS2R38 receptor is known to facilitate the tasting of the
synthetic organic compound phenylthiocarbamide (PTC). Not all human
individuals are able to taste this compound. The difference in
sensitivity to PTC was originally discovered in a laboratory
incident. Arthur L. Fox was pouring PTC powder into a bottle when
some of it "flew around in the air". A colleague of Fox complained
about a strong bitter taste, but Fox could taste nothing at all. Fox
then tested a larger number of individuals and found that most of
them were in two categories; either they were able to taste PTC.
There was apparently a genetic component, as individuals were much
more likely to taste PTC if other members of their family did.
PTC is a synthetic organic compound, but what is the biological role
of the tasting? It was eventually learned that PTC is chemically
related to toxic alkaloids in poisonous plants. About 70 % of all
human individuals are able to taste PTC. What do humans benefit from
being able to taste PTC and related alkaloids in plants? It seems
likely that it will help to avoid toxic plants, and one could expect
this function to be selected for among foraging societies. The two
SNPs referred to above, rs713598 and rs10246939 are both
associated with bitter tasting.
In the UCSC browser, display the "Genome variants" track.
Q5. What individuals have the
SNPs rs713598 and rs10246939 and as a result are expected to
taste PTC?
If you are interested in more tutorials on Galaxy, see here.
Alternative Galaxy servers:
- http://galaxy.nbic.nl/
- http://preview.galaxy.dbcls.jp/
- http://galaxy.cbiit.cuhk.edu.hk/
- http://bio.biomedicine.gu.se:8080/
Show your results to a supervisor!!
TS March, 2012