Brain SPORE Blog

July 2008 Archives

This is a follow up post to the previous one, with the full text of the article, which appears sometimes to be blocked by subscription. Here it is:
-----------------------------------------------------------

NCI to Roll Out Cancer Molecular Analysis Portal to Integrate Oncology Data from TCGA

[July 18, 2008; from bioinform.com]

By Vivien Marx

In an effort to broaden access to complex oncology data sets, the National Cancer Institute is preparing to unveil a new resource called the Cancer Molecular Analysis portal, which will integrate large, disparate genomics data sets from the Cancer Genome Atlas project and other cancer genomics studies.

The CMA portal is scheduled to launch next week with about a terabyte of brain cancer data from TCGA. TCGA's ovarian and lung cancer datasets are the next scheduled to arrive, with "a continuous flow of datasets into the CMA Portal over the next few months," said Subha Madhavan, associate director of life sciences informatics in the NCI's Center for Biomedical Informatics and Information Technology.

The second data set to be loaded into the CMA portal will be brain glioma data from approximately 500 patients from the so-called Rembrandt study headed by the NCI's Neuro-Oncology branch. "The reason for lining this up right after TCGA is to enable comparisons and correlations between two brain tumor datasets and neuro-oncologists will benefit from two large-scale comprehensive studies in one portal," said Madhavan, who is managing the CMA portal project.

Other NCI-supported projects will be imported into the portal as the data become available and as the data sharing policies are worked out for those studies. Two examples are the Target study, a childhood cancer initiative to catalog genomic changes in high-risk acute lymphoblastic leukemia and neuroblastoma; and CGEMS, or the Cancer Genetic Markers of Susceptibility study, an initiative to identify genetic alterations that make people susceptible to prostate and breast cancer.

Madhavan expects Rembrandt data to be available via the CMA portal by the end of this year. The timing for Target and CGEMS in CMA portal has yet to be determined, as the data must first become available and the sharing policies worked out, she said.

The portal, part of the NCI's Cancer Biomedical Informatics Grid project, is expected to enable researchers to integrate, visualize, and explore clinical and genomic characterization data, said Madhavan.

The initial version of the CMA portal will include genomics data from more than 200 patients suffering from glioblastoma multiforme, along with diagnosis information, treatment history, pathology status, the site of the tumor, and background on the patients’ surgery, said Madhavan.

Genome characterizations available through the portal will encompass sequence data, gene expression studies, copy number and SNP analysis, methylation studies, and miRNA expression data.

Kenneth Aldape, associate professor in MD Anderson Cancer Center's department of pathology, told BioInform in an e-mail that he expects this data to be of great value, particularly because "for each tumor sample, multiple platforms have been used to profile the cancer genome."

As a result, he said, "for the first time, we can integrate data from changes in the cancer cell on the DNA, RNA, and epigenetic levels. Insights gained from this integration of data will most likely lead to new ways that we can understand the molecular pathogenesis of glioblastoma."

Navigating the portal, users can view and access mutation profiles from tumor samples in reference to the human genome, mine clinical characteristics such as survival data and tumor staging, and correlate that with mutation and genome characterization results using a number of analytical tools.

While it is true that scientists can currently accomplish these tasks using other resources, Madhavan noted that there is currently no single integrated source for this information. "Look at the number of databases one would need to access," she said, citing clinical information, the metastatic status of patient tumors, tissue annotation, and expression data as examples.

"A lot of these tools and databases are geared toward sophisticated statisticians and analysts who know how to handle these tools, but the goal for CMA is to put [them] in the hands of the decision-makers, the physician-scientists," said Madhavan.

The portal is designed to let these end users work with the data without expert assistance, she said, using caBIG software functionality to help scientists find the type of datasets they need, from TCGA and elsewhere. "They can put in a gene name and it will bring back a Kaplan-Meier survival chart."

The first data set in the CMA portal is from the TCGA Pilot project, which is run jointly by the NCI and the National Human Genome Research Institute and aims to assess the feasibility of characterizing all human cancers by starting with three cancer types: brain, lung, and ovarian cancer. To date, TCGA has been making data available to the research community through a data portal launched last year.

The TCGA data portal, set up by the project's Data Coordinating Center, supports the program’s immediate data release policy. "This is a simple FTP site wrapped into a web site," said Madhavan, explaining that this site provides access to raw archives that the TCGA centers submitted.

The Cancer Molecular Analysis portal, however, is slated to become a comprehensive site that will include TCGA data and other data, too. "It presents analysis, summaries and allows users to link clinical data and genomic data, which is not possible in the [TCGA portal's] FTP wrapper," said Madhavan.

Bulk download of TCGA data will be possible through the TCGA portal, whereas the CMA portal providing analysis and data visualization capabilities under one site, said Madhavan.

"Some of the vision is to provide a unified view across multiple studies, so people can not only drill deeper into one study but they can cross-correlate and compare data across studies," she said.

Building in User Needs

The CMA portal offers researchers several data views: a "gene view" to analyze expression, copy number, SNP, and pathway data; a "genome view" to look at entire chromosomal regions; a "clinical view," which includes Kaplan-Meier survival plots and other data of clinical interest; and analytical tools such as GenePattern, a software platform developed by the Broad Institute that combines workflow with dozens of computational and visualization tools, or the Cancer Genome Workbench, developed by the NCI as a computational platform to integrate clinical tumor mutation profiles with the reference human genome.

Madhavan said that a key goal for the project was to maintain a user focus. "If your tools are not easy to use, you don't get adoption, and these clinician-scientists are so busy that you don't want these tools to have such a steep learning curve," she said.

"I think it will help my work and others in the field," said Herbert Newton, director of the division of neuro-oncology at Ohio State University Medical Center & James Cancer Hospital.

Newton told BioInform via e-mail that he believes the portal and this kind of data integration "will become more and more valuable as we make further progress with translational programs to develop molecular-based treatments."

"For the first time, we can integrate data from changes in the cancer cell on the DNA, RNA, and epigenetic levels." In particular, the glioblastoma multiforme data set "will be very helpful to neuro-oncology researchers working on molecular aspects of high-grade gliomas," he said. Although there is much information available on the topic, "this will be a much broader effort for characterization of these genes, with a very large and ambitious set of genes to analyze."

Newton said he expects the TCGA glioblastoma multiforme data set to "eventually become the 'gold standard' for molecular characterization and analysis of GBM."

Madhavan said that an important source of input was a use case workshop for the TCGA data portal held in January, which brought together bench researchers, clinicians, statisticians, and computer scientists who jointly defined how the portal should be configured to house TCGA data. The participants were both eager to build technology up and also take down barriers between clinical and research disciplines, said Madhavan. "It's absolutely amazing to see what these groups can do when you put them in one room. They don't talk to each other every day."

The CMA data can be explored online with runtime analysis tools that are part of the portal, but it can also be downloaded for downstream analysis by biostatisticians. "Users can go in and select the data types and patients of interest for easy bulk download of data along with clinical and tissue annotations," said Madhavan.

To obtain that functionality, the NCI team partnered with a number of external researchers, including Peter Park, a bioinformaticist at Boston's Children's Hospital Informatics Program and at the Harvard-MIT Division of Health Sciences and Technology who is also on the faculty of Harvard Medical School, to understand how the community will want to access that data and to create ways to let them do so.

Madhavan noted that Park was "very passionate:" about how researchers will want to "slice and dice" these datasets, such as according to clinical parameters like tissue quality, in order to prepare the data for further analysis with tools of their choice.

Working the Matrix

The NCI developers worked with colleagues from Lawrence Berkeley National Laboratory, Stanford University, MD Anderson, and the University of North Carolina to create a "data access matrix," which offers users access to different "levels" of data and is "a key functionality of the CMA portal," she said. This group also became the portal's beta testers.

As Madhavan explained, "level 1" data is anything that comes out of a machine, such as probe-level data in case of an Affymetrix array. "Level 2" data in that example would be CHP files with information normalized within a given sample, while "level 3" would be segmented data and "level 4" would comprise genomic regions of interest.

For the matrix, the team sought to clearly indicate to portal users what level of data they are downloading, she said. Scientists seeking to do their own analysis will want mainly raw data, such as what is found in levels 1 and 2, while others may want only processed information.

“The data matrix simply allows one to select sections of the data more easily and reduces the time and effort necessary to obtain the data in a usable format," Park told BioInform via e-mail. In the case of copy number data, for example, “level 1 is the raw log-ratios, level 2 is normalized log-ratios, level 3 is segmented profiles, and level 4 is the regions called significant aberrations," he said.

"For instance, a bioinformatician interested in every step of the analysis may want to download the raw data, but clinicians might want data at the level of genes," said Park. One researcher might want to study expression levels and matched methylation levels for patients with poor survival rates, while another may want to study copy number and expression in another group of patients, he said.

An important goal for the portal, Park said, was to reduce the time that it currently takes to download public data sets and format them for analysis. "Most available data sets are poorly annotated and much effort is required by users to link different parts of the data," he said.

Another issue is reproducibility. "In general, it is nearly impossible to replicate a result described in a paper by downloading the data and following the description given by the authors, especially when the data are complex." The data matrix approach "is attempting to make this a bit more friendly," he said.

Another aspect that the CMA developers considered was patient privacy. The TCGA project defined its own patient-protection policies, and "our job on the CMA portal was to implement those patient privacy protection policies to help ensure that we are protecting the research participants in a manner that is consistent with HIPAA as well as their consent forms," said Madhavan.

However, as the portal expands to include data from other projects, it will likely encounter a range of different access models. "One has to think carefully about how this data will be shared," said R. Mark Adams in a presentation outlining the Cancer Molecular Analysis portal at last month’s caBIG annual meeting in Washington, DC.

Adams, a senior associate at caBIG contractor Booz Allen Hamilton, added that grappling with privacy issues "can be as challenging or more challenging than informatics or technical issues." The problem, he said, is "coming up with ways that we can safely provide widespread access to the data to the widest range of researchers in keeping with protecting the participants."

CMA handled this challenge by using a tiered approach. One tier is open-access data, such as gene expression profiles, which are publicly available to users without a log-in, Madhavan said, adding that this information “cannot be aggregated to generate data that is unique to an individual."

The portal also includes a controlled-access data tier, which contains clinical data and individually unique information and requires user certification for data access.

For small research labs and community-based cancer centers with only a small number of samples, researchers might use the CMA Portal to increase the statistical power of an analysis, said Madhavan, adding that everyone benefits from the portal's "instantaneous data release" policy.

"These projects are putting out these data sets in a publicly accessible way even before the publication has come out," she said. "This is why we are getting interest from outside the TCGA group," she said.

As Madhavan explained, the goal of the CMA portal is "to lower the barrier to entry to the portal by making open-tier datasets available to users in an easily usable fashion." As datasets are prepared for the portal, the access policy will need to be tailored to the dataset. For example, Target is a childhood cancer initiative to catalog genomic changes in certain types of pediatric cancers.

"Such an implementation [the open-access tier] may not readily work for Target, where children are involved and the patient privacy concerns are heightened. Hence, we may have to make some changes to the CMA portal software to implement the data release policies of the Target project," said Madhavan.

Powered by an Integrator

The CMA Portal is powered by caBIG's caIntegrator module, which had only been applied to smaller studies prior to the portal project. As a result, Madhavan said, the team's first task was to see if the 1 terabyte dataset could even be loaded into it.

Madhavan said that at the "heart" of caIntegrator is CGOM, or the clinical genomics object model, which is caBIG's standard representation for clinical and genomic findings and the annotations that go along with them.

"There is also a real-time analytic engine that provides this on-the-fly computational analysis," she said. Users can select patient cohorts with certain criteria and punt that over to any of dozens of analytic tools, such as GenePattern.

This semantic interoperability is expected to save researchers time, Madhavan said. For example, if a scientist wants to correlate overall survival in patients with a mutation rate in a particular gene, that would require "a lot of semantic connectivity between mutation data and clinical information, [so] that is what we spent most of the time on … figuring out the semantic touch-points between these different data types."

Adams said in his caBIG talk that an important goal of the portal is to make the data accessible to researchers in a user-friendly, integrated format. "Often the insights in this information are hidden in terms of finding how to correlate the multiple subsets of information," he said.

Quoting part of a wish list by Daniela Gerhard, the NCI's director of the office of cancer genomics, Adams said that the CMA portal is envisioned as a way to make this data accessible, and not by saying, "'Go to the FTP site and knock yourself out.'"

Visit the new CMA Portal here.

Cancer Molecular Analysis Portal

| Comments (0) | Trackbacks (0)

BioInform: NCI to Roll Out Cancer Molecular Analysis Portal to Integrate Oncology Data from TCGA

A little snipped appeared on one of the subscription Bioinformatics newsletters, suggesting the imminent rollout of a new portal for TCGA data. If anyone has a subscription, please post more information, or send me some text from the full post (obogler@mdanderson.org).

The Cancer Genome Atlas (TCGA) is an effort by the NCI to characterize cancers in depth, at the genomic level. One of the cancers in the TCGA is glioblastoma.

Even more interesting, in the context of the SPORE, is that NCI sees a strong connection between the TCGA and the SPOREs. One aspect of this is that the SPOREs are acting as a source of tissue samples - our group, under the leadership of Dr. Ken Aldape in Pathology, was the first center to provide glioblastoma tissues to the TCGA.

TCGA is analyzing tumors for:
- Broad Institute of MIT and Harvard, Cambridge, Mass.
Using the Affymetrix platform, this center will identify changes in expression and copy number alterations that occur in cancer.
- Harvard Medical School and Brigham and Women's Hospital, Boston, Mass.
Using the Agilent platform, this center will characterize tumor samples for alterations in chromosome segments copy number. This center will also develop new technologies to analyze expression profiles.
- Lawrence Berkeley National Laboratory, Berkeley, Calif.
Using an Affymetrix Exon 1.0 array platform, this center will identify changes in the transcription profiles that occur in cancer.
- Memorial Sloan-Kettering Cancer Center, New York, N.Y.
Using Agilent arrays, this center will provide characterization of chromosome segment gains and losses. This center will also develop new approaches to detect novel genetic rearrangements.
- The Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins University, Baltimore, Md.
This is a joint project with the University of Southern California/Norris Comprehensive Cancer Center, which will use Illumina GoldenGate Genotyping platform, to detect changes in methylation profiles associated with transcribed genes in cancer samples.
- Stanford University School of Medicine, Palo Alto, Calif.
Using Illumina HumanHap550 Genotyping BeadChip, this project will identify chromosome segments copy number variation found in cancer.
- University of North Carolina Lineberger Comprehensive Cancer Center, Chapel Hill, N.C.
Using an Agilent array platform, this center will identify changes in the transcription profiles that occur in cancer.

Join Us
Now the SPOREs are thinking about how to digest the data emerging from TCGA and designing the right back end studies. As part of that effort we hold TCGA/SPORE meetings every second Friday at 3pm in FC7.3035 - all welcome.

TCGA Knowledge Base
In today's meeting we discussed the state of the data available and how to translate specific biological questions into queries that can be applied to the current data. The new TCGA Knowledge Base was introduced. The TCGA-KB is an interface to access the raw TCGA data (e.g. CEL files) and the goal is to eventually add normalized data. Eventually a query interface will be added to the site.

Genboree
This is a group that is managing the sequencing effort for NCI - see at www.genboree.org. We will attempt to make contact to get access.

The University of Texas M.D. Anderson Cancer Center SPORE in Brain Cancer, together with the M.D. Anderson Cancer Center Brain Tumor Program (BTP), is currently announcing its solicitation for applications to be funded as developmental research projects.

Awards will be made for amounts up to $50,000 and funded for one year. Carry-over for an additional year of funding is possible, depending on the project progress. Electronic submissions are required. All funded investigators will be required to present a summary of their research data at the UT M.D. Anderson BTP Seminar Series, and summary reports will be required at the end each year of funding, or when required by SPORE progress reports.

All submitted proposals that meet the stated requirements will be sent through scientific evaluation by internal and/or external peer review. Based on recommendations from the NIH, as well as our IAB and EAB, projects that expand or otherwise enhance the current full SPORE projects will receive high priority. Projects will be listed at the SPORE website: http://www.mdanderson.org/brainspore.

Recommendation for funding will be based on the following criteria: 1) Scientific merit 2) Technical soundness 3) Innovation 4) Potential for translation to or from clinical application 5) Multidisciplinary interaction 6) Appropriateness of the proposed budget and resources 6) Potential impact on reducing brain tumor morbidity and mortality; potential impact on improving quality of life for brain tumor patients. Use of human tissues is required, and translational goals are to be clearly stated.

All applicants must hold either an M.D. or Ph.D. degree or equivalent and hold a full-time position at M.D. Anderson Cancer Center, Baylor College of Medicine, UTMB Galveston, or UT Health Science Center at Houston-School of Medicine. Please contact Theresa Willis, SPORE Administrator at
twillis@mdanderson.org or (713) 794-1419 if you have any questions.

APPLICATION DEADLINE: July 18, 2008 – 5:00 p.m.

APPLICATION FORMAT:
Applicants are asked to submit a cover page (see below), a five-page research proposal, a budget and budget justification (does not have to be detailed - one paragraph will suffice), a three-page biographical sketch in NIH format, Other Support in NIH format, a description of resources and study environment, and involvement and consideration of human subjects and vertebrate animals. The five-page research proposal must contain the following information:
1. Title
2. Hypothesis to be tested
3. Specific aims
4. Gaps in knowledge to be addressed
5. Background and translational significance
6. Plan of research
7. Literature cited (outside the 5-page limit)
Junior investigators without significant independent research experience are asked to also provide a mentorship plan including a statement of commitment from the senior mentor.

Copies of an approved Fred Checklist and all protocols (animal and human use) must be provided within 30 days of notice of award, or the developmental funds will be awarded to the next in line.

FUNDING LEVEL: up to $50,000 for one year
For more information and forms please contact twillis@mdanderson.org