Using BioConductor to analyse microarray data
This lab will take you through an example where R and BioConductor is
used to analyze microarray data. It is adapted from one of the examples in
the Users Guide for the BioConductor package LIMMA.
R is open source software and can be downloaded from
http://www.r-project.org/, but
should already be installed on your computers for this lab. R is a general
(statistical) data analysis tool, but it can be extended with "packages",
and several such packages in the area of bioinformatics exist. The
BioConductor project,
http://www.bioconductor.org, has produced several such packages, and
below, we will use the LIMMA package. Packages can in general be installed
using the "Packages" menu of the R window, but the LIMMA package should
already be installed on your computer for this lab.
- Download the example microarray data, from
http://bioinf.wehi.edu.au/marray/genstat2002/swirl.zip, and unzip it in some
suitable directory. The resulting .gal file contains information about the
spots printed on the arrays. The .spot files contain the output from the
Spot image analysis program used on the scanned images of the four
microarrays the example consists of. The .txt file gives a small
overview of the experiment.
The data was obtained from an experiment using
zebrafish as a
model organism to study the early development in vertebrates. Swirl is a
point mutant in the BMP2 gene that affects the dorsal/ventral body axis.
The main goal of the Swirl experiment is to identify genes with altered
expression in the Swirl mutant compared to wild-type zebrafish.
Look at the SwirlSample.txt file. It indicates that in two of the
experiments, the Cy3 dye has been used for the wild type sample, while
in the other two experiments, Cy5 has been used for the wild type. Can
you guess why one has swapped the dyes in this way?
- Start R (should be available in your menu of programs). Change the
directory of R (using File->Change dir...) to the directory where you
put the data. Activate the "limma" package with the command
library(limma)
- Read the filenames of of the files containing the intensity data
into the object "files", with the command
files <- dir(pattern="*.spot")
You can display the contents of (for example) the "files" object by
simply writing
files
You can read about (for example) the dir function using the command
help(dir)
- Read in the data to an object RG with
RG <- read.maimages(files, source="spot")
You can list the objects created so far with
ls()
The analysis below proceeds by applying functions to already created
objects, where the output of the functions is assigned to new objects
using the "<-" operator. You can choose to name the objects as you like,
as long as you use corresponding names in the following analysis.
By displaying RG, i.e., just writing RG, you see that it consists of several parts.
Writing for example
A <- RG$R
you store the matrix consisting of the red signal values from the
four microarrays in a new matrix A. You can find the mean of all these
values by writing
mean(A)
Compare this mean with the mean of the green signal values (given in
the G component of the RG object). Find the number of spots where the
red signal is greater than the green, and vice versa. (Hint: the command
sum(A<B) will count the number of times elements of matrix A are smaller
than elements of matrix B). What might be the reason for the difference?
- Add information about the genes to the RG object with the command
RG$genes <- readGAL()
where the readGAL function reads its information from the .gal file
in the current directory.
- Add information about the layout of the microarrays (i.e., which
genes are printed where, and in what order) to the RG object using
RG$printer <- getLayout(RG$genes)
Plot an image of the background values for the red intensities from
the first microarray:
imageplot(log2(RG$Rb[,1]), RG$printer)
Note: In R, the notation A[,1] picks out the first column of a matrix
A, just as A[1,] picks out the first row, and A[1,1] picks out the first
element. Note also that the log2() function takes the logarithm with
base 2.
Do you think the image indicates some sort of problem? Do all
the 4 microarrays have the same problem?
- Normalize the data, using loess normalization:
MA <- normalizeWithinArrays(RG)
Use the command
plotMA(MA)
to visualize your data after this operation. (Note: The command will
only visualize data from the first microarray; to visualize data from
for example the second, use the command plotMA(MA[,2]). )
In this plot, one has calculated, for each gene, a = (log2(r) +
log2(b))/2 amd m = log2(r) - log2(g), where r is the red signal
and g the green signal of the gene. The gene has then been plotted as a
dot, with a along the x axis and m along the y axis. In this plot, where
would genes that are differentially expressed appear, in general?
- Normalize the scale between the arrays:
MA <- normalizeBetweenArrays(MA)
This command makes sure the variability the data from each microarray
is the same. Can you see if this command affects the M component or the
A component of the MA object?
- Estimate all fold changes (i.e., how many times the expression level
of a gene has increased or decreased) by fitting a linear model. The
"design" indicates that the colors have been swiched between the types
of samples for every second array.
fit <- lmFit(MA, design=c(-1,1,-1,1))
- "Robustify" the estimation of the variances:
fit <- eBayes(fit)
This command computes the variance estimates used in the t-tests in a
more robust manner.
- Produce a list showing 30 genes most likely to be differentially
expressed. The "adjust" parameter indicates that the p-values produced
in the table control the "false discovery rate": If you for example pick
out 100 genes which all have fdr-adjusted p-values of less than 0.05,
this means that less than 5% of these genes are expected to be false
positives, i.e., that their apparent differential expression is just a
result of experimental noise.
options(digits=3); topTable(fit, n=30, adjust="fdr")
Try to interpret at least the A, M and P.value columns of the
table. How can such a list be used in further biological resarch?
- The normalization in point 7 adjusts the m-values of genes so that
the m-values are on average zero. Use the command
MAraw <- normalizeWithinArrays(RG, method="none")
to transform your data to the MA format without this adjustment. What
is the difference in the plot produced by plotMA? What is the difference
in the final list produced?
- Try to skip point 8 in the computations. How does this affect the
result?
The last three exercises have a more loose form; they suggest things
you might do to learn more about R and BioConductor:
- Use help to investigate some more options in the functions you have
used above.
- One of the most powerful features of tools like R and LIMMA is that
one can combine high-level powerful commands like those used above with
low-level simple commands where you can study and change the data in
simple direct ways. A simple way to access help on both R and its
packages is the command
help.start()
which opens a web browser with an overview of different resources.
Look at "An Introduction to R", and find some further functions of R
that might be useful when studying the data, for example statistical
functions.
- Use the help.start() function, and then click on the links "Packages",
and then "limma", "overview", and "LIMMA User's Guide". Here, you can
study a more extensive version of the example gone through above.
|