Practical - Exploring information resources in molecular biology - the UCSC genome browser.


  Provide answers to questions that are indicated in italics and marked with 'Q'.

The purpose of these excercises is to introduce a web-based system for accessing genome information and other molecular biology data, the  UCSC Genome Browser. The exercises will also introduce basic concepts of the human genome such as the organisation of genes, conservation with respect to other animals, repeat elements and individual  variation. We will see how a gene is found in the browser with different methods and examine the structure of that gene. We will also use some more advanced features of the UCSC browser, including the "Table Browser" that allows you to download data of interest. 

For help and tutorials on the UCSC pages see also these links:
    http://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html
    http://www.openhelix.com/downloads/ucsc/ucsc_home.shtml
        UCSC Genome Browser : An Introduction
        UCSC Genome Browser : Custom tracks and Table Browser

In case of network difficulties with the UCSC browser, try any of these mirrors.

Part I. Basic usage of the UCSC browser - Factor IX.    

 
We will start with an examination of the blood coagulation Factor IX, also known as the Christmas factor. It is of medical importance because a genetic deficiency of the protein causes hemophilia B. Go to UCSC Genome browser, on the top of the page you see several links (Genomes, Blat ...), choose "Genome" or "Genome Browser" to the left. Select Mammal under "clade", and Human under "genome", we will use the Feb 2009 "assembly". In "position/search" you may enter the genomic or the accession number (identification number) of the desired sequence as well as a search term (gene name for instance). So type 'factor ix' , 'factor 9', or 'F9' in the position/search window and click Submit. You will get a list of matching sequences, click on the F9 (u004fas.1) sequence link under the heading "UCSC genes".

You'll get a page being the actual genome browser window showing the contents of the selected chromosomal region. There are sequences aligned to that part of the genome, as well as additional information. On top of the page there are buttons to use for zooming in/out and moving left/right. On the "position/search" line you can also see the current chromosomal positions being displayed. Below the graphic window are buttons, including a refresh button and a list of tracks that can be turned on/off.

Let's focus on the graphic display window. There are many sections or "tracks". Almost at the top you find "UCSC Genes (RefSeq, ...)" in blue (where the Factor IX gene is), and below depending on what tracks are on, there might be displayed RefSeq genes, Human and Non-Human mRNAs (black), alignments of different mammals, simple nucleotide polymorphisms (SNPs), etc.

First, we will get to a summary of information about the Factor IX gene by clicking on one of the sequences in the window (or by clicking the F9 to the left of the graphic window). You will be taken to a new page with loads of information about the F9 gene. Among other sources of data we here see references to 1) Genbank mRNAs, 2) Uniprot/SwissProt and 3) PDB. For instance, identify the heading "Descriptions from all associated GenBank mRNAs". Click on one of the mRNAs listed there, M11309; the resulting page has information on that single sequence (it is a Genbank mRNA entry). Go back to the page "Human Gene F9 (uc004fas.1) Description and Page Index" and try to find the answers to the following questions.

Q1. What is the function of Factor IX?

Q2. Is the protein found inside the cell or outside?

Q3. Are there any 3D structures for the protein? In such as case, what were the methods used to derive these structures?

In a section about Biochemical and Signalling Pathways, there is a link to BioCarta: "h_intrinsicPathway". Click on it and you will get a figure of the pathway in which Factor IX takes part. You can see several other coagulation Factors (X, XI, XII ...) and a calcium ion at FIX. Q4. What protein acts on Factor IX  in order to change it to its active form (FIXa)? Now we return to the main graphic browser window. Look at our Factor IX gene, introns are shown as "fish bones" (>>>), the arrows pointing in the gene direction. Exons are shown as lines/boxes and UTRs (untranslated regions) are thinner lines/boxes.
Q5. On what chromosome is Factor IX located?

Q6. How many exons are there in the Factor IX gene?

Q7. There are two variants of F9 (uc004fas.1 and uc004fat.1) shown in the browser. They are two different transcripts of the same gene. What is the difference between them?

Zoom in on the first exon. If you have a sufficiently small window the amino acid sequence encoded by the exon will be displayed.

Q8. How many amino acids does the coding part of the exon code for?

Q9. How many nucleotides are within the 5' UTR?

Now zoom out to see the environment of the FA9 gene. Q10. What human gene is located "to the right" of Factor IX, and what strand is it on? Are there any different transcripts of this gene, how many?

Q11. What human gene is located "to the left" of Factor IX, and what strand is it on? What are the approximate distances between the Factor IX gene and the respective flanking genes?



Part II. Pharmacological and medical implications of human genetic variation

In this part we will be looking at SNPs, a major form of human individual variation. We will examine some medically and pharmacologically important SNPs.

1. Clearance of methotrexate.

Methotrexate is a drug used to treat malignancies such as acute lymphoblastic leukemia (ALL). Methotrexate is an inhibitor of folate synthesis; it therefore reduces DNA synthesis and causes apoptosis in proliferating cells. However, human individuals respond differently to this drug and for this reason it is important to identify the genetic factors involved. In one study a gene named SLCO1B1 was identified as being associated with methotrexate clearance. SLCO1B1 encodes an organic anion transporter that is responsible for disposal of not only methotrexate but many other drugs. There are a number of SNPs in the SLCO1B1 that are relevant. Two of these are shown below (based on Ramsey et al Genome Res. 2012).

hg19 chr12 location
SNP
Ref allele
Variant allele
AA sub
change in methotrexate clearance
21331799
rs4149056
T
C
V174A
decrease
21330063
rs11045819
C
A
P155T
increase

Start by viewing the rs4149056 SNP in the browser window. You can find it by making a text based query like you did for F9 above.
Q13. Consider the exon with rs4149056. Select to display the SNP information in dbSNP 135 (all SNPs). How many synonymous and non-synonymous SNPs does the exon have?

Consider the different human individuals represented in the UCSC browser. To select that track find under "Variation and Repeats" - "Genome Variants" , change "hide" to "full".

Q14. According to the SNPs shown in the Table above, what individuals are likely to have a decreased and increased methotrexate clearance, respectively?

2. Warfarin dosage and VKORC1

Warfarin is a vitamin K antagonist and was initially marketed as a pesticide against rats. It it widely used also in medicine to inhibit blood coagulation. Therapy with warfarin is characterized by a wide inter-individual variation in dose requirements and accurate dosing is critical for safely managing patients on this drug. Warfarin inhibits epoxide reductase, or to be more precise one of the subunits of this enzyme which is encoded by the VKORC1 gene. There is one SNP in this gene known to affect the efficency of warfarin, rs2359612. An individual with the genotype A/A should be give a lower dose of warfarin as compared to an individual with the genotype G/G.

Q15. What individuals, according to what you see in the UCSC browser, should be given a relatively high dose of warfarin in case they needed such therapy?


3. SNP rs6025

The SNP rs6025 encodes a change in a protein from an arginine to a glutamine. The rs6025(A) allele encodes a mutation known as the Leiden mutation, R506Q, found in perhaps 3 to 5% of the individuals in most populations. About 1 in 10 individuals harboring the R506Q will experience clinically significant venous thromboembolism in their lifetimes. Use the UCSC browser to answer this question:

  Q16. What is the name of protein encoded by the gene that harbors the rs6025 SNP?

Part III. UCSC Table Browser

In this part we will illustrate the use of the UCSC Table Browser. It provides text-based access to the genome assemblies and annotation data stored in the Genome Browser database to retrieve specific data. This tool offers an enhanced level of query support that includes restrictions based on field values, free-form SQL queries, and combined queries on multiple tables. Output can be filtered to restrict the fields and lines returned, and may be organized into one of several formats, including a simple tab-delimited file that can be loaded into a spreadsheet or database as well as advanced formats that may be uploaded into the Genome Browser as custom annotation tracks. The Table Browser provides a convenient alternative to downloading and manipulating the entire genome and its massive data tracks.


Identify genes with CAG repeats

First, let's examine a group of genes that are characterized by three-nucleotide repeats. Such repeat regions are often associated with disease. One example is Huntington's disease, which is a result of the expansion of a three-nucleotide repeat consisting of the triplets CAG. Huntington's disease is not the only disease that is caused by a CAG triple repeat expansion. Thus, a number of other hereditary neurodegenerative diseases involve such an expansion. One example is the disease DRPLA involving the gene ATN1.

Here is a short list of genes with trinucleotide repeats from McMurray, C. Mechanisms of trinucleotide repeat instability during human development. Nature Reviews Genetics 11: 786-99:

Disease
Trinucleotide
Gene
Location
DRPLA
CAG
ATN1
CDS, exon 5
Huntington
CAG
HTT
CDS, exon 1
DM1
CTG
DMPK
3' UTR

We will here use the Table browser to identify human genes with CAG repeats.

Go to the UCSC Genome Browser homepage and select the Table Browser, by clicking either of the Table Browser links from the homepage (Table browser/Tables).

In the Table Browser window, select the human genome, the Feb. 2009 assembly and the group "Variation and repeats". Then select the track "Simple Repeats".  For a specific track there may be one or more tables to describe it. In this case there is only one table "simpleRepeat". Click on the button "describe table schema" if you want to get more information on what is contained in that table.

Then we define the genomic region(s) to search. In this case we will search the entire genome and make sure that under "region", the "genome" alternative is selected.

Now click the summary/statistics button.

                Q20. How many simple repeats are there in the human genome? (check "item count")?

Now we will filter the data to identify only the repeats with the sequence CAG. Click on the button filter: Create. In the resulting form there are a number of fields but we will only make use of "sequence does match" and enter "CAG" in that field instead of the default asterisk (*). Click Submit. Note that the filter button has now changed to two buttons "edit" and "clear".

Again, click on the summary/statistics button.

                Q21. How many simple repeats with the sequence CAG are there in the human genome?

We now want to output the result. For this example we will leave the boxes Galaxy and GREAT unchecked. Depending on the data you want to obtain there are different output formats. There are several different output formats available for this data table:

The first two  selected ("all fields from selected table" and "selected fields from primary and related tables") get the fields of data from the primary table or selected fields from all related tables. This downloads as a tab-delimited text file that can be later used in a word processing or spreadsheet program. The next ("sequence") allows you to obtain the DNA or protein sequence of the items in the table in a FASTA format. The next two (BED , Browser Extensible Data format and GTF, Gene Transfer format) are database formats to use in other programs and databases. BED is the format used by the Genome Browser database. The custom track output creates an annotation track of your query in the Genome Browser and the Table Browser for further study. This newly created annotation track can be viewed and searched just as any other annotation track.

Finally, you can get a list of hyperlinks of the data positions in the Genome Browser.

For this example select "all fields from selected table". Leave the field "Output file" empty. Click the "get output" button. A new page will open with a list of CAG repeat regions.

                Q22. How many CAG repeat regions are there on the Y chromosome ? 

4.2 CAG repeats continued - intersections

We saw in the previous exercise an example of filtering data. We will now examine intersections with the Table Browser. The intersection tool allows you to find if two datasets have any overlap. For example, we may want to know if there is any chromosomal location overlap between the “known genes” dataset and the “simple repeat” dataset, and we may in that case want to download the data of this overlap region.
Here we will attempt to identity all “CAG” repeats that are within known genes and download these sequences.

If you return to the Table Browser, your previous search and filter should have remained.

Clicking on the "intersection" create button will take you to an intersection page. 

Here, you choose the group, annotation track and table that you wish to intersect with the table that you selected on the main page. We intersect our simple repeats with UCSC genes to find which of our filtered repeats reside in known genes. Take note, the UCSC Known Genes table will only include coding exons in intersections.

You can choose “any overlap”, “no overlap”, or “at least” or “at most” percentage of overlap. You can also select either an intersection or union of the data sets. Here we choose any overlap. Once you have completed your choices, click submit.

You will find that the “intersection” choice changes to “edit” and “clear” and text appears that shows you the intersection. If we look at the summary/statistics as we did earlier, you will see that by intersecting our filtered repeats with UCSC genes in the entire genome, we’ve narrowed our search.

As output format select "hyperlinks to Genome Browser". Click 'get output'. This will lead to a page with hyperlinks to the browser window; each link with a specific CAG repeat region within a known gene.

We will examine four of the repeats. First, follow the link to "trf at chr4:3076604-3076667". You will discover that this is the gene for the Huntingtin protein, HTT. The CAG repeats gives rise a poly-glutamine region at the protein level.

Q23. Zoom out so that you view the entire HTT gene in the browser window. Is the polyQ-region located in the N-terminal or C-terminal region of the protein?

Also check out the link to "trf at chr12:7045880-7045938". This will show the gene ATN1 mentioned above.

Then, examine "trf at chr17:17697094-17697134". This links to the gene RAI1. Examine information for this gene.

Q24. What disease may be associated with a polymorphism of the poly-Gln region in the RAI1 gene?

Finally, examine the link to "trf at chr19:46273463-46273524". This is a repeat region in the gene DMPK. We now enter into a bit of confusion as this gene is on the complementary minus strand but repeats are listed in the Genome Browser database by the reference strand orientation.

Q25. What is the trinucleotide repeat of the DMPK gene/mRNA? You may be helped by reversing the sequence in the browser window by clicking the "reverse" button below the browser image window.

Part IV. Uploading your own data to the UCSC browser.

Task: Creating a custom track to show your own information or annotation.

If you have genome-wide information obtained from your own experimental work you may upload it to the UCSC browser. Here is a very simple toy example of how to add a custom track to the browser. Go to the UCSC Genome Browser Gateway, select the March 2006 Human genome and click on "add custom tracks" or "manage custom tracks". Paste the text below in the "Paste data" box and press Submit. In the resulting page click "go to genome browser" to see you data.

browser position chr22:20100000-20100900
track name=coords description="Chromosome coordinates list" visibility=2, color=255,0,0,
chr22   20100000 20100100
chr22   20100011 20100200	
You should now see a new track with lines corresponding to the regions listed above.

Q26. What is the information that you would submit to highlight in a similar manner the first exon of the human factor 9 gene?

Show your results to a supervisor!!

TS/MDL March, 2013