Practical -Exploring information resources in molecular biology - Galaxy


Provide answers to questions that are indicated in italics and marked with 'Q'.

Galaxy - What promoters contain TAF1 binding sites?

The following exercise is based on a tutorial here.  We assume that you have obtained in a CHIP-Seq experiment a list of sites in the human genome where the transcription factor TAF1  binds. (TAF1 = Transcription initiation factor TFIID subunit 1). The list that you are to use in this exercise is hereStart by saving it to your local computer. You now want to see what gene promoters overlap with the TAF1-binding sites and you want to view these in the UCSC browser.

We make use of Galaxy.

First use Tools => Get Data => Upload file => Browse. Choose the file that you saved above. Click Execute.

To the right of the Galaxy screen there is a section labeled "History". When an operation in Galaxy is complete the entry in History will show up in green. When this has happened for your uploaded file, click on the link to it to view some of its properties. Click the question mark "?" right after the expression "tabular, database:" to edit the properties of the data:

For the Database/ Build - select Human Mar. 2006 (hg18).

Click Save.

Click the pencil icon in the green area of "taf1_CHIP.txt" to edit the data type. Change from the default value to "interval". Click Save.

Also change Chrom column to "2", Start column to "3" and End column to "4".  Change Name/Identifier column to "5".
Click Save.

Next we will get gene annotation data from the UCSC browser. Select "Get Data"  from the main Tools menu. Follow the link to the UCSC Main table browser.
Select Mar. 2006 version  of the genome and the RefSeq Genes track of the "Genes and Gene prediction" Tracks group.
Select "genome" as the region.  Set the output format to "BED - browser extensible data". Click "get output". In the following screen do not modify anything, just click "Send query to Galaxy". Now gene coordinates will be uploaded to Galaxy.  In the resulting green window under "History" click the pencil icon and rename the dataset to "Refseq". Click Save. 

We are interested in promoter regions and we therefore want to include a region upstream of the genes as listed in the Refseq dataset.   To do this, go to Tools => Operate on Genomic Intervals => Get flanks.  Specify "Length of the flanking region(s)" as "1000".   Click Execute. Rename the new data set to "Promoters". Click Save. 

In the next step we will clean up the Promoter data set by extracting a limited number of columns from it. Therefore, go to Tools => Text manipulation => Cut columns from a table. In the "Cut columns" field enter "c1,c2,c3,c4,c6". Click execute. Rename the dataset to "Clean promoters".

Then, it is time to join the promoter dataset with the TAF1 binding sites. To do that, go to Tools => Operate  on Genomic Intervals => Join. Specify as First and Second queries taf1_CHIP.txt and "Clean promoters", respectively. Click Execute.

                Q1. How many regions are the result of the join procedure?

Finally, we will examine the results in a genomic context with the help of the UCSC browser. We can simply do this by clicking on the "UCSC main" link in the green window being the result of the join operation. You are now in a position to graphically view all promoter regions in the genome that overlap with experimental data on TAF1 binding.


Galaxy - What SNPs are unique to Bushmen?

This exercise is based on the tutorial here.

Studies of human evolution show that the root of the human tree occurs in Africa. Analysis of mitochondrial sequences shows that the first lineage to branch off from the human phylogenetic tree is a mitochondrial haplogroup named L0. This group is frequent among San in Southern Africa and in the Sandawe of East Africa. The complete genomes of Southern African individuals to be examined in this exercise are of this old lineage. The individuals, named !Gubi, G/aq'o, D#kgao and !Ai, are Namibian hunter-gatherers known as Khoisan or Bushmen. They will be referred to below as KB1, NB1, TK1 and MD8, respectively. In addition to the Bushmen, the genome of archbishop Desmond Tutu (below referrred to as ABT), a Bantu individual, was sequenced in the same project.

The purpose of these exercises is to
        1) Identify sites in chromosome 7 where the bushmen KB1 and NB1 are different from the reference sequence and from ABT.
        2) Find exons that contain these SNPs.
        3) Examine more closely the SNPs of one particular gene on chromosome 7.

We will be working with data that originated from a dataset available from Galaxy (http://main.g2.bx.psu.edu/library). Data for chromosome 7 was extracted and the resulting file is bushmen_chr7_all_snps.txt . Save this file to your computer. It has 24 columns. Column 4 is the human reference sequence, columns 6 and 8 are the KB1 and NB1 Bushmen and column 14 is ABT.

Connect to Galaxy. Go to "Get data" => Upload File from your computer. As for the TAF1 data in the previous exercise specify the Database/ Build to be that of Mar. 2006 (hg18).

The SNP data that you are about to analyze is a quite large dataset with about 600,000 SNPs. First, we will filter the data. Select "Filter and Sort" => "Filter data on any column using simple expressions"

Specify in the box "With following condition:" the following:
 
        c4 != 'N' and c14 != 'N' and c6 != 'N' and c8 != 'N'

(The symbol == means "equal to" and the symbol != means "not equal to". ). 'N' is a symbol in the SNP table to denote that the base in that position is unknown. We want to filter out such positions because we only want to compare positions where we know the base for each of the individuals that we are examining.

Click "Execute" to start the fitering procedure. The filtering could take a few minutes. 

                Q2. How many SNPs remain after the filtering?

Next step is to identify the SNPs characteristic of the bushmen. We can make yet another filtering as above but now with the conditions: 

        c4 != c6 and c4 != c8 and c14 != c6 and c14 != c8

                Q3. How many SNPs remain after this filtering?


Rename the results of the last operation "SNPs".

The final step is to identify the SNPs that are in exons.

We obtain the exon information from UCSC. Go to "Get Data" => "UCSC Main". At UCSC interface, select group assembly Mar. 2006, "Genes and Gene Prediction Tracks" and track "UCSC Genes". Make sure chr7 is selected. Output format = BED. Click Get output. In the resulting page select under "Create one BED record per:" "Exons plus" instead of the default "Whole Gene". Click "Send query to Galaxy".

Rename the resulting dataset to "Exons".

Go to "Operate on genomic intervals" => "Join".   Join the dataset "SNPs" with "Exons". Click "Execute".   

                Q4. How many regions were identified in this final step?

The results may now be visualized by clicking on the link to "UCSC main" in the green box.

Visualize the region chr7:141,318,330-141,320,555. It contains the gene for the receptor TAS2R38. You will find that this gene has SNPs unique to the Bushmen. Examples are

            chr7:141,319,813    rs713598 G>C
            chr7:141,319,072    rs10246939 A>G

The TAS2R38 receptor is known to facilitate the tasting of the synthetic organic compound phenylthiocarbamide (PTC). Not all human individuals are able to taste this compound. The difference in sensitivity to PTC was originally discovered in a laboratory incident. Arthur L. Fox was pouring PTC powder into a bottle when some of it "flew around in the air". A colleague of Fox complained about a strong bitter taste, but Fox could taste nothing at all. Fox then tested a larger number of individuals and found that most of them were in two categories; either they were able to taste PTC. There was apparently a genetic component, as individuals were much more likely to taste PTC if other members of their family did.

PTC is a synthetic organic compound, but what is the biological role of the tasting? It was eventually learned that PTC is chemically related to toxic alkaloids in poisonous plants. About 70 % of all human individuals are able to taste PTC. What do humans benefit from being able to taste PTC and related alkaloids in plants? It seems likely that it will help to avoid toxic plants, and one could expect this function to be selected for among foraging societies. The two SNPs referred to above,  rs713598 and  rs10246939 are both associated with bitter tasting.

In the UCSC browser, display the "Genome variants" track. 

            Q5. What individuals have the SNPs rs713598 and rs10246939 and as a result are expected to taste PTC?




If you are interested in more tutorials on Galaxy, see here.


Alternative Galaxy servers:
Show your results to a supervisor!!

TS March, 2012