When I first cut my teeth in IT security some years ago, I was a systems administrator for a division of the University Corporation for Atmospheric Research, the parent of the National Center for Atmospheric Research here in Boulder. UCAR/NCAR is what Gordon Bell calls a “data place” – an organization whose mission in part is to serve as a storehouse of large volumes of atmospheric monitoring and modeling data.

When I first joined UCAR, my group seemed a bit enigmatic to me, since their primary focus was on the use of GPS technology in fields such as seismic monitoring. What were they doing at the National Center for Atmospheric Research?

But then I learned a bit more about this group’s mission. My division was responsible for collecting and warehousing GPS data from monitoring sites around the world – and one of the useful functions of this data was its value in predicting rainfall and storm severity. At one time this was a difficult challenge for meteorologists, but what became evident from our data was that GPS signal delays are related to a number of factors – including atmospheric water vapor content.  So it was that our GPS data, initially collected to measure geologic activity, was put to work on tasks such as monitoring the “dry line” in the south central US where our worst thunderstorms often occur – and that was just one area of atmospheric research to which our data was applied.

This is an example of what Jim Gray called the “fourth paradigm” of scientific investigation: the primary research of data becoming a discipline in its own right – “eScience” – as investigators probe findings accumulated through earlier the paradigms of experimentation/observation, theory and computational simulation.

Ever since, I have wondered what other new learnings lie hidden in the vast amounts of data organizations are accumulating everywhere. What new discoveries are waiting for us to find them, not in theories or experimental techniques per se, but in the data they generate – as well as in the data we already have?

To me, it seems that IT security and risk management is ripe for such investigation.

For years, we have been groping our way toward defining ways to better measure and understand risk in this realm. But this is still a nascent discipline, and there is one thing that it still lacks. It’s not so much the data itself. That we have, and lots of it. What we do not have is a broad and consistent body of evidence from data analysis that would yield measurable understandings – though that body of evidence is beginning to build.

The late Peter Bernstein once observed that the reason financial risk had proceeded so far ahead of other aspects of business risk management was that the world of finance had a considerable body of data available to provide the raw materials for the mathematics of understanding. Other fields simply did not have the same wealth of data. Today, we have much the same situation facing us in IT security – except that we do have a lot of raw material already. We collect volumes of monitoring and intelligence information. We simply do not yet realize what much of it means. We have yet to make a broad application of the disciplines of data science in order to synthesize this information and more fully explore this still largely undiscovered country.

We also desperately need more accurate and timely insight into modern threats that evade detection simply because our approaches are outmoded. We’ve long known that signature-based defenses are becoming overwhelmed by the sheer volume of threats, as well as by the innovations of highly skilled adversaries who can probe a wealth of opportunities in the complexities of IT. It’s not just that poring through the vast array of potential security issues in any environment requires far greater data analysis capabilities than most of our techniques enable today. We also need ways to better identify meaningful, actionable insight in all this information.

Today, the technologies of data mining and management are growing along with the data explosion. Deployable data appliances and data warehouse “racks” concentrate compute, networking and storage capabilities in packaged form factors purpose-built for data mining and analysis (examples: Greenplum [now EMC], Netezza [now IBM]Oracle ExadataTeradata). Structured storage and NoSQL tactics allow for more horizontal database scaling, while techniques such as MapReduce and toolsets such as Hadoop enable distributed computing for large data sets. Together these techniques are specifically designed to make large bodies of data more “digestible” and more responsive to search, mining and analysis. The increased accessibility of cloud computing could make these techniques more accessible to a wider range of organizations – provided the much-discussed risks of cloud computing can themselves be mitigated.

Regardless of these capabilities, we still have obstacles to overcome in security in the variety and nature of data we must examine. As noted in the previous post in this series, security professionals must deal with three main data types: textual information, quantitative data, and what I consider “object” data such as binaries. How to yield synthesis from these very dissimilar data types?

Consider, for example, that monitoring tools, intelligence reports and other security-relevant sources usually produce large amounts of textual data. How can we make this text content more useful as quantitative information that would better inform strategists responsible for security management?

How about identifying – and perhaps predicting – security trends buried in a mass of text data? One such example is this investigation by Stephan Neuhaus of the Università degli Studi di Trento in Italy, and Trent Zimmermann of Microsoft. In what they claim is the first independent study of the whole body of the Common Vulnerabilities and Exposures (CVE) database outside MITRE, Neuhaus and Zimmermann have developed a technique for identifying emerging security trends through the application of unsupervised learning and Bayesian inference to textual data. How many other opportunities like this have yet to be discovered in security-relevant data stores?

What about the reverse problem: making quantitative data available for synthesis with textual content, to share and learn from community efforts to better understand IT risk? Search has made a wealth of new information available to everyone – but while textual information lends itself well to search, quantitative data largely does not. If some of the most useful technologies of “big data” management have been developed in support of search, how can we apply them to quantitative information?

And what about the third category: object data, which poses challenges of its own in meaningful synthesis – but which is highly meaningful in security, not least in the capture and analysis of attack binaries or other evidentiary data?

To answer these questions, I fully expect security professionals, data scientists, and the technologies of data management and Business Intelligence to converge in some very innovative approaches to the opportunity.

I’ll offer a few examples in the next post in this series.

Enhanced by Zemanta