OpenGamma for storing exchange historical data?

Hi ,

Few questions i have on the back-end used in openGamma

  1. Is OG storing tick data of all companies in the exchange in the local DB ?
  2. If so will the db be able to accommodate say last 15 years worth data ?
  3. Will the current database used can scale to this level of usage ?

Thanks
Vineeth

There is no automatic process to populate our databases with historical data for everything on a given exchange. You need to explicitly load the time series you’re interested in. Internally we have a particular universe of securities that we maintain time series for, typically members of equity indices but also include non-exchange traded rates (libor, swap, fx, etc).

Fundamentally though, you need a list of identifiers for the securities traded on an exchange to be able to load them and you need some kind of market data adapter to be able to pull the data into OpenGamma. We then run a nightly script to update them each day. Having said that, it’s not particularly difficult at all to write your own time series loader.

The database should be able to handle large numbers of long time series. The scalability will depend on your hardware and software setup. We haven’t tested postgres with very high numbers of time series so far, but to give you an idea, we’ve probably had around 10,000 series of 2-20 year history with no problems on fairly modest hardware.

Well our usage would be as below.
For a single company there would be 600 times series with daily frequency (At least to start with)
There would be 7000 such companies and we want to store data for past 15 years.
Will postgres or hsql scale to this amount ?

The short answer is that we haven’t tested it at that kind of scale - you’re talking about 4,200,000 time series with 15 years of data point each (so, around 16 billion data points). That theoretically shouldn’t be a problem for any reasonable RDBMS. The problems you’re likely to run into are far more around keeping your database up-to-date. Inserts in particular can become a problem if you’re doing millions of rows a day, particularly if you’re also replicating the data. You’ll need some strategy for managing updates as it’s unrealistic to expect to be able to refresh the entire database each day to deal with bad data points that inevitably exist in most historical data sets.

If you ever did hit performance issues with Postgres you’ve got a couple of options:

  • Shard the data over multiple time series masters and use a combining master to aggregate/split them. This is pretty straightforward.
  • Move to a column store - we'll be supporting Vertica for time series, but obviously this is a commercial option.