FREE BI Review Site Registration!
Sign-up today and access BI Review on the Web!

Your FREE registration entitles you to:

FREE email newsletters

FREE access to all BI Review content

FREE access to web seminars, resource portals, our white paper library and more!

   

Just Enough, Just in Time

From raw text and databases to customized information delivery at Biogen Idec

Since the dark ages, when the local pub was the best place for collecting information to secure a business advantage, business intelligence has always existed. From guilds and mercantile buyers and sellers all the way to today, gathering business information and organizing it into databases has correlated to better news collection and distribution (and now an information glut). We are currently in a golden age of good to excellent databases of business information and news services catering to the majority needs for business information.

This golden age of business-oriented databases and news services has created the need for two new capabilities to enhance business intelligence. First, the overwhelming flood of information, even summarized and collated, has created the need for data mining of business information to extract the nuggets needed for a particular business user. Second, the databases we use are not customized to particular needs, and multiple databases of information are often required for a comprehensive view of a market need.

The focus of this article is efficient aggregation and analysis of information from multiple databases into customized deliverables. To avoid staff growth in line with the number of customized deliverables, we need to rely upon the latest in information aggregation, workflow technologies, and text/data mining to make customized information delivery an efficient, scalable, and manageable process.

As an analogy, we could look at the raw text of news and the scientific literature as raw ore to process and the various structured databases of Pipeline and Deal information as recyclable metals. We need a process to take the ore extraction and combine it with recycled metals into a single product serving the needs of our customers.

Our example of a business intelligence need is in competitive intelligence for bioPharma. We have several good pipeline databases (PharmaProjects, TrialTrove, IDDB) and news services (Factiva, NewsDesk, Google News) supplying information to the bioPharma space but none of the pipeline databases are comprehensive and the news services are hard to tie directly into the pipeline database information. We will look at approaches to aggregate, filter and deliver a customized view of information for specific bioPharma businesses needs using the latest Web-based workflow technologies. The approach used will be easy to customize for a variety of departments and new purposes. Of course, we will obscure the specifics of what we are doing due to proprietary information concerns, but the examples will be easy enough to review and apply to the reader’s needs.

Tools

In order to collect, integrate, analyze and deliver customized streams of information to our business development customers, we employ a variety of tools that need to be meshed together. These tools need to be maintainable and re-deployable for new custom requests. The goal of our approach was not to develop any more internal software than necessary. A bioPharma company needs to focus on producing drugs, not large software projects. We prefer our internal focus to be on integration with a minimum of development work to link best-of-breed tools available from the software services sector.

Collect:

The techniques for collecting information from a multitude of heterogeneous sources range from simple SQL queries to the use of involved Web extraction protocols. Search engines and databases are queried to collect data and unstructured text. For continuous information streams, we use RSS/Atom as our information stream protocol of choice. In cases where RSS does not exist for an information stream, we either use Web extraction technologies (screen scraping) or an application to parse email into individual items of news or alert information. These extracted items are then incorporated into a very simple news database with RSS capabilities (built from Drupal, http://www.drupal.org/). Information and database providers license their content in a variety of ways and under copyright law, so be sure to review your rights and licenses to the sources you will be using.

Analyze:

The precision of the information delivered can be improved through sophisticated filtering or text categorization approaches. We can also utilize text mining tools such as Linguamatics to extract specific relationships between entities such as companies, diseases, adverse events or drugs.

Deliver:

Newsgator’s Enterprise Server (http://www.newsgator.com/) serves as our alert management interface for our customers. Aggregated and filtered RSS feeds are delivered by the Newsgator application for distribution to our customers. The product also lets us collaboratively monitor a number of news and information streams for competitive intelligence or research purposes by sharing the RSS streams amongst team members and publishing important items to a group RSS feed.

Results from information extraction of information streams are delivered in tabular form for Web-enabled databases, an area that is overdue for improvement. While Excel is still the most-used database technology in any company, it is not our database of choice. Unfortunately, developing Web-enabled databases remains difficult and time-consuming. Technologies like DabbleDb (http://www.dabbledb.com/) show promise as lightweight online databases that are easy to use and reconfigure.

Integrate:

The InforSense Platform (http://www.inforsense.com/) application serves the role of glue to tie these various capabilities together. If Yahoo Pipes (http://pipes.yahoo.com/pipes) brought global attention to drag and drop programming for data workflows, InforSense provides that functionality and a great deal more— including text analytics and data mining. We can extract and collect data directly from databases, Web pages, documents, or RSS feeds through InforSense. Once collected, the data, is pre-processed into a clean, properly formatted dataset for analysis. The workflows (Figure 1) are easier to understand than programs using VisualBasic or Perl having similar functionality. The workflow itself can serve as documentation of the business rules for the application.

Deliverables

An example of a customer need is a news feed that incorporates a wide variety of news and information sources, but filters the results using strict and easily alterable rules. We capture a range of sources from government news, industry-specific news, general news streams, patents, and academic literature—such a wide range that no single service provides overall coverage. All of these feeds need to be aggregated and filtered into one high-precision news feed that is accessible via an RSS news reader and email.

We use RSS feeds collected into our Newsgator server for a particular customer as the starting point and built a workflow (Figure 1). This workflow takes an input of RSS feeds and parses them into table structure to manage the content. The original stories referenced from the RSS items are then collected (at the DownloadWebPages node). There are two filter steps. The first removes items that match ‘stopwords’ such as ‘charity event’ from either the RSS item title, RSS item description or the original Web page. The second filter step performs a positive selection to collect any Web page content that references the disease indications of interest. Additional filter steps are easy to add to this workflow.

The workflow in Figure 1 was copied from the beginning of the workflow to the DownloadWebPages node point in the workflow and redeployed as a new data processing node. A set of RSS feeds are entered into this data processing stream to be fed into the RSSread node. The resulting table of Web page content and metadata from the RSS items (Title, Description, Date, URL) is delivered to the Linguamatics I2E build index and query index nodes for automated text extraction. An example extraction from this information stream could be something like “Drug Targets and Related Adverse Events.”

These workflows are easily deployed and can be run on a scheduled basis. The generated RSS feeds are fed back to our NewsGator server. The extracted information is fed to an Excel spreadsheet but will hopefully be loaded into an online database by the time this article is published.

Future

Next steps are to replicate our first deliverables to other customers by cloning our current workflows and making the appropriate changes to suit new customer needs. We still need better Web-enabled databases and more ‘connectors’ to additional databases such as the Clinical Trials and NIH Grants Database (CRISP). We will be investigating the variety of visualization technologies available for business intelligence analyses (i.e. “The New Data Visualization”—Oct 2006 Business Intelligence Review) to improve the communication of our results during the next steps of this on-going project.

Conclusion

With the wealth of information available in bioPharma and the incredible databases supporting widespread business needs, it is a challenge just to cover the basics of intelligence gathering. Covering the basics does not make for a competitive advantage. With approximately one trillion dollars spent on biomedical research during the last fifteen years (Moses, et al, JAMA 294 (11) 2005) and the resulting literature and data from that expenditure, we need to be much more diligent at utilizing the variety of information resources to extract full value from these information ‘ore deposits.’

To fully support the wide variety and ever increasing business development and research activities in bioPharma, we have to be better at aggregating and delivering information as it is needed and where it is needed. Data workflow technologies such as InforSense can make it a manageable process to develop customized information flows for individuals or project teams. RSS and Web-enabled databases (a topic of its own) can greatly increase the efficiency of information aggregation and delivery on an ongoing basis.

A variety of people have played important roles in this adventure from our highly collaborative vendors, our Research Informatics department, to my team members though our customers deserve the credit for motivating us.

Click here for figure 1: Information Workflow


William Hayes is Assoc. Dir., Library & Literature Informatics at Biogen Idec.

For more information on related topics, visit the following channels:



Industry Vendors