Data science, big data, data storage & mining, curation, analytics, in-memory techniques, graph databases

Both Climate Modeling and Earth System Modeling entail petabytes (1015 bytes), if not exabytes (1018 bytes) of observational data and sensor network data, as well as the vast amounts of data output from the simulation process itself.

Generally speaking, observational data is best stored locally, at a place near to where the data has been collected - simply because moving data has a high cost, and the 'pride of ownership' factor helps preserve the quality and integrity of such data on a long term basis. Simulation output data on the other hand, is best stored near to the computing centres at which the simulations are made, albiet easy access to such data is needed by the analysts and decision-makers, where ever they may be located.

When useful data happens to be stored in several separate facilities, it needs to be 'federated' and 'harmonized' so as to become accessible and useful. Gaining access to remotely stored data through global networks is required in such cases, and the security and integrity of such data must be preserved through encryption techniques.

Data is only useful if it is of known quality and the circumstances of its collection are well understood. Such 'meta-data' is itself another level of data that needs to be carefully curated and kept safe, just as does the original underlying source data.

This webpage covers most aspects of the entire data life cycle: 

Data science: 

Big data, big dreams

The world after big data

Big data's dirty little secret

The future of data science

Don't be a big data snooper

Building the big data highway

Data-Intesive system evolution

The evolution of the data scientist

Finding patterns in corrupted data 

Is your smartphone spying on you?

A bottom-up approach to data quality

The future of big data is .... Javascript?

Software as a service for data scientist

New techniques turbo-charge data mining

Big data requires big vision for big change 

Reinventing society in the wake of big data

Charting a course out of the big data doldrums

The evolving art (and business) of data curation 

Why developers need to think like data scientists

Which programming language is best for big data?

Expert panel: what’s around the bend for big data?

What exactly is big data - if it's neither big nor data?

The new era of computing: an interview with 'Dr. Data'

The third age of data and the unfolding scale-out world

Big data challenges and advanced computing solutions

The origins of 'Big Data': an etymological detective story 

Only a fraction of the 160 zettabyte 'datasphere' to be stored

Big data and the creative destruction of today's business models

From microprocessors to nanostores: rethinking data-centric systems

What CIOs and CTOs need to know about big data and data intensive computing 

Re-platforming the enterprise, or putting data back at the center of the data center 

Moving big data:

Globus moves 1 exabyte

Five reasons for leaving your data where it is

How to move 80 petabytes of data without down time

Sizing up big data:

Global datasphere to hit 175 zettabytes by 2025

Is big data dead? MotherDuck raises $47M to prove it

Data analytics - realtime and in-memory:

Pulling insights from unstructured data

Getting ready for real-time decisioning

Rating the advanced analytics vendors

Software engineering for data analytics

Five steps to de-mystify big data analytics

Analyzing video, the biggest data of them all

What's driving the rise of real-time analytics?

Arrow aims to defrag big in-memory analytics

Algorithms trump big data, apps and analytics

Peering into the crystal ball of advanced analytics

Beyond big: the analytically powered organization 

Inflexible data, analytics fueling failures, survey finds

Text analytics and machine learning: a virtuous combination 

Operationalizing data-driven decisions: a 5-step methodology

Mission analytics: data-driven decision making in government

5 ways big geospatial data is driving analytics in the real world

Combining HPC and big data analytics on the same infrastructure

Understanding data intensive analysis on large-scale HPC compute systems

Reducing big data using ideas from quantum theory makes it easier to interpret

Transitioning from big data to discovery: data management as a keystone analytics strategy

Big data analytical advances from academia, business are enhancing exploration of our universe

In-memory big data:

In-memory database goes 'translytical' 

In-memory computing is the key to real-time analytics

Using in-memory data grids for global data integration

Using in-memory data grids for global data integration

How IMDGs can analyze fast-changing data in real-time

In-memory boosts Oracle OLTP by 2X, analytics by 1000X

Using an in-memory data grid for near real-time data analysis 

Cloud-baed big data:

What is the CAP theorem?

Big data in the public cloud

Tracking the rapid rise in cloud data

Microsoft scales Azure Data Lake into exascale territory

AI and big data:

Why knowledge graphs are foundational to AI

Five reasons machine learning is moving to the cloud

From data to knowledge: machine-learning with real-time and streaming applications

Big data in science:


Seeing stars through the cloud

How big data advances physics

NOAA launches big data project

Why science really needs big data

Big data revolution in astrophysics

EU project looks to scale Earth data

A geodata fabric for the 21st century

AI called in to tackle LHC data deluge 

Next generation team science platform

DOE focuses on scientific data integration

Supercomputer sails through world history

DOE exascale roadmap highlights big data

Astronomers leverage 'unprecedented' data set

JPL, Caltech team up to tackle big data projects

SKA prepares for the ultimate big data challenge

Big data in space: martian computational archeology 

Really, really big data - NASA at the forefront of analytics 

Tool enables scientists to uncover patterns in vast data sets

Los Alamos releases file index product to software community

15 million Euro boost for to manage European astronomy big data

As supercomputers approach exascale, experts wrestle with big data

Networking, data experts design a better portal for scientific discovery

To know, but not understand: David Weinberger on science and big data

Spatial data platform from SpaceCurve for real-time operational intelligence

Codesign challenges for exascale systems: performance, power and reliablility 

Core scientific dataset model: a lightweight and portable model and file format for multi-dimensional data 

Storage Systems, Data Lakes & Data Warehousing:

Storage at exascale

SDSC cloud storage services

Big Data file formats demystified

Data storage using individual molecules

High performance scalable unified storage

Optimize storage placement in sensor networks

ArongoDB reaping the fruits of its multi-modal labor

Is GOLAP the next wave for big data warehousing?

Availability in globally distributed storage systems

Multiparadigm data storage for enterprise applications

Stepping up to the life science storage system challenge

Software-defined storage takes off as big data gets bigger

Data lakes and overcoming the waste of 'data janitor' duties

New data storage is to Dye for - avoids DNA storage pitfalls

Big data, big demand: navigating the cloud storage landscape

Data warehouse modernization in the age of big data analytics

A four-phased approach to building an optimal data warehouse

To centralize or not to centralize your data - that is the question

The top 5 reasons to use multi-tier storage for managing scientific data

Storage systems for 'big data' dramatically speeds access to information

Phase change memory-based moneta system points to the future of computer storage

Vendor specific storage & tools:


HP: Exascale Data Center 


The complexity of VMware storage management


Fujitsu lets big data cloud flag fly     

Fujitsu develops world's first cloud platform to leverage big data


IBM big data VP surveys landscape

IBM design wins the storage challenge at SC10

IBM announces HPC storage solution for streaming data

IBM demos record-breaking parallel file system performance

IBM storage breakthrough paves way for 330TB tape cartridges
Parallel file system OrangeFS starts to build a following


MINE: Detecting novel associations in large data sets

MINE: Maximal Information-based Nonparametric Exploration


Presto poised for a breaout year as data explosion continues


Forrester reshuffles the deck on BI and analytic tools


The State of the Lustre Community

Why Lustre Is Set to Excel in Exascale 

Xyratex announces acquisition of Oracle's Lustre assets


Can Hadoop be simple again? 

Hate Hadoop? Then you are doing it wrong

Hadoop: Big Data, Big Analytics, Big Insights

Large-scale seismic signal processing with Hadoop

Why Hadoop isn't the Big Data solution that you think it is

Spark just passed Hadoop in popularity on the web - here's why

Database choices:

Different databases for different strokes

Oracle aims to break big data silos with SQL

RDBMS remains popular as data sources grow

Array databases: the next big thing in data analytics?

Self-driving databases are coming: what next for DBAs?

The polyglot problem: solving the paradox of the 'right' database

SQL vs non-SQL:

The new math driving NoSQL analytics

How SQL++ makes JSON more queryable

Crowded NoSQL wave shows abundant options

Graph Databases:

Graph databases worth $5.1B by 2026

AWS unveils 'Neptune' graph database

A look at the graph database landscape

Azure joins the Graph500 with Top20 showing

5 factors driving the graph database explosion

Graph databases gaining enterprise ready features

Why young developers don't get knowledge graphs

DIVE: a graph-based visual-analytics framework for big data

How mathematicians use homology to make sense of topology

KAIST introduces T-GPS, a tool for processing a trillion-edge graph on one computer

Neo4j delivers graph database hardened container in collaboration with DoD Platform One

Graph visualization: 

Giga graph cities: their buckets, buildings, waves and fragments

Evaluating representation learning and graph layout methods for visualization

Graph maths:

Mathematicians answer old question about graphs

Database virtualization:

Is now the time for database virtualization?