Microarray data analysis lab  

Using BioConductor to analyse microarray data

This lab will take you through an example where R and BioConductor is used to analyze microarray data. It is adapted from one of the examples in the Users Guide for the BioConductor package LIMMA.

R is open source software and can be downloaded from http://www.r-project.org/, but should already be installed on your computers for this lab. R is a general (statistical) data analysis tool, but it can be extended with "packages", and several such packages in the area of bioinformatics exist. The BioConductor project, http://www.bioconductor.org, has produced several such packages, and below, we will use the LIMMA package. Packages can in general be installed using the "Packages" menu of the R window, but the LIMMA package should already be installed on your computer for this lab. 

  1. Download the example microarray data, from  http://bioinf.wehi.edu.au/marray/genstat2002/swirl.zip, and unzip it in some suitable directory.  The resulting .gal file contains information about the spots printed on the arrays. The .spot files contain the output from the Spot image analysis program used on the scanned images of the four microarrays the example consists of. The .txt file gives a small overview of the experiment.

    The data was obtained from an experiment using zebrafish as a model organism to study the early development in vertebrates. Swirl is a point mutant in the BMP2 gene that affects the dorsal/ventral body axis. The main goal of the Swirl experiment is to identify genes with altered expression in the Swirl mutant compared to wild-type zebrafish.

    Look at the SwirlSample.txt file. It indicates that in two of the experiments, the Cy3 dye has been used for the wild type sample, while in the other two experiments, Cy5 has been used for the wild type. Can you guess why one has swapped the dyes in this way?

  2. Start R (should be available in your menu of programs). Change the directory of R (using File->Change dir...) to the directory where you put the data. Activate the "limma" package with the command

    library(limma)

     

  3. Read the filenames of of the files containing the intensity data into the object "files", with the command

    files <- dir(pattern="*.spot")   

    You can display the contents of (for example) the "files" object by simply writing

    files

    You can read about (for example) the dir function using the command

    help(dir)

     

  4. Read in the data to an object RG with

     RG <- read.maimages(files, source="spot")

    You can list the objects created so far with

    ls()

    The analysis below proceeds by applying functions to already created objects, where the output of the functions is assigned to new objects using the "<-" operator. You can choose to name the objects as you like, as long as you use corresponding names in the following analysis.

    By displaying RG, i.e., just writing RG, you see that it consists of several parts. Writing for example

    A <- RG$R

    you store the matrix consisting of the red signal values from the four microarrays in a new matrix A. You can find the mean of all these values by writing

    mean(A)

    Compare this mean with the mean of the green signal values (given in the G component of the RG object). Find the number of spots where the red signal is greater than the green, and vice versa. (Hint: the command sum(A<B) will count the number of times elements of matrix A are smaller than elements of matrix B). What might be the reason for the difference?

  5. Add information about the genes to the RG object with the command

    RG$genes <- readGAL()

    where the readGAL function reads its information from the .gal file in the current directory.

  6. Add information about the layout of the microarrays (i.e., which genes are printed where, and in what order) to the RG object using

    RG$printer <- getLayout(RG$genes)

    Plot an image of the background values for the red intensities from the first microarray:

    imageplot(log2(RG$Rb[,1]), RG$printer)

    Note: In R, the notation A[,1] picks out the first column of a matrix A, just as A[1,] picks out the first row, and A[1,1] picks out the first element. Note also that the log2() function takes the logarithm with base 2.

     Do you think the image indicates some sort of problem? Do all the 4 microarrays have the same problem?
     

  7. Normalize the data, using loess normalization:

    MA <- normalizeWithinArrays(RG)

    Use the command

    plotMA(MA)

    to visualize your data after this operation. (Note: The command will only visualize data from the first microarray; to visualize data from for example the second, use the command plotMA(MA[,2]). )

    In this plot, one has calculated, for each gene, a = (log2(r) + log2(b))/2 amd m = log2(r) - log2(g),  where r is the red signal and g the green signal of the gene. The gene has then been plotted as a dot, with a along the x axis and m along the y axis. In this plot, where would genes that are differentially expressed appear, in general?

  8. Normalize the scale between the arrays:

    MA <- normalizeBetweenArrays(MA)

    This command makes sure the variability the data from each microarray is the same. Can you see if this command affects the M component or the A component of the MA object?

  9. Estimate all fold changes (i.e., how many times the expression level of a gene has increased or decreased) by fitting a linear model. The "design" indicates that the colors have been swiched between the types of samples for every second array.

    fit <- lmFit(MA, design=c(-1,1,-1,1)) 

     

  10. "Robustify" the estimation of the variances:

    fit <- eBayes(fit)

    This command computes the variance estimates used in the t-tests in a more robust manner.

  11. Produce a list showing 30 genes most likely to be differentially expressed. The "adjust" parameter indicates that the p-values produced in the table control the "false discovery rate": If you for example pick out 100 genes which all have fdr-adjusted p-values of less than 0.05, this means that less than 5% of these genes are expected to be false positives, i.e., that their apparent differential expression is just a result of experimental noise. 

    options(digits=3); topTable(fit, n=30, adjust="fdr")

    Try to interpret at least the A, M and P.value  columns of the table. How can such a list be used in further biological resarch?

  12. The normalization in point 7 adjusts the m-values of genes so that the m-values are on average zero. Use the command

    MAraw <- normalizeWithinArrays(RG, method="none")

    to transform your data to the MA format without this adjustment. What is the difference in the plot produced by plotMA? What is the difference in the final list produced?

  13. Try to skip point 8 in the computations. How does this affect the result?

The last three exercises have a more loose form; they suggest things you might do to learn more about R and BioConductor:

  1. Use help to investigate some more options in the functions you have used above.
  2. One of the most powerful features of tools like R and LIMMA is that one can combine high-level powerful commands like those used above with low-level simple commands where you can study and change the data in simple direct ways. A simple way to access help on both R and its packages is the command

    help.start()

    which opens a web browser with an overview of different resources. Look at "An Introduction to R", and find some further functions of R that might be useful when studying the data, for example statistical functions. 

  3. Use the help.start() function, and then click on the links "Packages", and then "limma", "overview", and "LIMMA User's Guide". Here, you can study a more extensive version of the example gone through above.