Performance testing for OG timeseries data store

Hi ,

I have been using postgres as timeseries storage for OG.
But then now as i have around 90 GB of data in postgres , the speed is very bad on the UI side.
Its taking insane amount of time for a timeseries to get loaded in the UI part.
It took 54 seconds for a timeseries to get loaded.

What are my options here. Will moving to some other db help ? Or do we have a nosql option atleast for timeseries storage.

Thanks
Vineeth

The UI side may be caused by a generic issue of paging through the result documents that performs very poorly with large numbers of documents, especially if you’re paging towards the end of the dataset. There are now some much caching implementations that should prefetch ahead and cache the documents. I shall try and get the engineer involved to point you at the appropriate configurations. We can then review and see if that improves things enough for you.

There is currently no nosql time series implementation.

Hello @jim ,

Is it possible to accommodate other sql solutions like monetDB with OG. Also can you suggest some of the optimizations that can be done at postgres side to enhance performance ?

Thanks
Vineeth

Hi,

OG is specifically designed to facilitate adding support for other databases for the masters, including the time-series master, but it would still involve some work. A new OG database dialect needs to be created (refer to the existing hsqldb, postgresql and sqlserver2008 dialects) and the queries (which are cleanly isolated in ElSql files under OG-MasterDB) may need to be customized for the new dialect.

Alternatively, a general solution to such performance issues is currently under development in the guise of caching masters. Any or all of the masters (under the develop branch) may be wrapped with a cache, which will attempt to keep documents in-core for fast retrieval.

Having said that, work on time-series caching specifics is at an early stage and consequently the current time-series cache in the develop branch might be of limited use to you at the moment. In its current state it only caches time-series documents (the ‘metadata’), but not the actual time-series data points. This might still help if your observed UI slow-down is due to large numbers of individual time-series, but won’t be of much use if your problem is with individual time-series that contain huge numbers of data points. The current estimate for availability of the full data point caching logic in the develop branch is between one and two weeks from now.

Nonetheless, you may still enable caching right now for specific masters by editing fullstack-example.ini (assuming you are using OG-Example, always in the develop branch) and adding, in the case of the time-series master:

[ehcachingHtsMaster]
factory = com.opengamma.component.factory.master.EHCachingHistoricalTimeSeriesMasterComponentFactory
classifier = caching
underlying = ::central
cacheManager = ::hts

This will wrap a caching layer around the database-backed time-series master that should already be in your fullstack-example.ini:

[dbHtsMaster]
factory = com.opengamma.component.factory.master.DbHistoricalTimeSeriesMasterComponentFactory
classifier = central
dbConnector = ::hts

Additionally, you will need to specify which of the various OG subsystems will use the caching time-series master on a case-by-case basis, again by editing fullstack-example.ini. For instance, to instruct the web UI to make use of the time-series cache:

[webBasics]

historicalTimeSeriesMaster = ::caching

You might also choose to update other sections, such as [historicalTimeSeriesSource], to use the caching master.

The same procedure applies in enabling caching for the other masters. Please keep in mind that this is development code in flux, and as such may exhibit quirks. Of course, you are encouraged to report any issues you might encounter!

Regards,

Kevin

Hi ,
I dont think caching can help my issue in anyway. I have 500GB+ postgres database and the timeseries UI is in a point of waiting for ever.
I tried taking to 1.2.x branch to see how it will work with it.
I took the branch ran ant build on the root , ran ant init-database in the OG-Example folder , changed the properties of fullstack to point to my DB and then gave a ant jetty-debug.

Now when i try to see a timeseries in the UI , am seeing a GC error - https://gist.github.com/Vineeth-Mohan/5326318/raw/f0df70b390459d98110a75638d1534de5c629673/gistfile1.txt

RAM - 8GB
Now for reading as single timeseries , i dont see a reason for a GC error. This means something else spooky is happening parllely.
Also my team mate was telling that he did not have much issues reading timeseries using timeseries master from inside the code. (Need to check on that though)
Kindly let me know what can be done about this.

Thanks
Vineeth

Hi,

There are a couple of additional details that could help us diagnose the issues you are encountering.

Could you indicate the approximate size of the time-series (number of data points) that is giving you problems in the UI, please? Are you also experiencing the same issue with smaller time-series or does the problem only appear with larger individual time-series, even in the context of your large time-series database?

Regards,

Kevin

There is 10 years worth data in DB. so it would be 10*365 data points maximum. I can guarantee that this dont have anything to do with the size of the timeseries. I ran a code that will take all timeseries and dump it as CSV in the same process and it worked just fine. Again 8GB of main memory for a mere timeseries is something we cant think about. So i am pretty sure the root cause is something else.

@kevin - Let me know if you need any other piece of information.

That time-series size is pretty average, so I would agree with your conclusion.

The stack trace you provided indicates that the problem occurs when searching for related time-series, that is, all time series whose sets of external IDs intersects the current time-series’s set of external IDs. This search is performed in order to populate the drop-down that allows you to select time-series for different data fields (e.g. YLD_YTM_MID, PX_LAST, etc.).

There are a couple of questions I would like to ask you in relation to this:

  1. Does your time-series master contain a particularly large number of time series that would be considered related, i.e. all of which contain a common external ID?
  2. Does your time-series master contain time-series with an empty external ID set? I have identified a bug that might result in similar behaviour under such circumstances.

Regards,

Kevin

Helllo @kevin ,

Yes , there are atleast 100 timeseries per security in my database. That means there should be atleast 100 related timeseries.
Empty external ID - I need to check. And there is no way i can check it as the UI is stuck.
What can be done here ?

Thanks
Vineeth

Vineeth,

After further testing it turns out that an empty external ID set would not cause the problem I’d assumed it would, hence I wouldn’t follow that line of investigation any further.

In the meantime, I have pushed a minor modification to the time-series UI in the develop branch that should affect the issue you have been encountering. Could you please try out the latest code in develop and let me know if this improves matters at all?

Regards,

Kevin

Thanks @kevin , that issue is fixed now.
Can you push the change to 1.2.x too ?

Thanks
Vineeth

Hello @kevin ,
I went through your fix of limiting the results using pagination. But will that be the right permanent solution. Again if the result is just 100 or 200 string (which is the field names) how did it eat up all the GB of memories ?
I feel the fix only addresses the symptom and we need to track the root cause.

Thanks
Vineeth

Vineeth,

I tried reproducing the problem locally before applying the pagination fix. However, the worst I could do was to obtain a 25 second delay, by having over 10,000 (ten thousand!) related time-series returned in the original unpaginated search, practically the entire content of my time series master. Incidentally, the result from the back end is not just a list of strings, but the actual related time-series documents (excluding the data points, of course). In my case, the JVM was configured with just -Xmx2g and did not suffer from memory issues at any point.

The pagination fix is really just a general, high-level sanity check to contain time-series load times in the UI. But at this point I think the main question is, what exactly is leading to the deterioration and memory issues in your particular instance, but not in other installations?

What is your current JVM heap limit set to? What happens when you try executing an identical search API call (i.e. for related time series) outside the UI? Could you try restarting the engine, repeating the same UI operations afresh, and observing if the result varies? Last but not least, how many distinct time-series does your master currently contain?

Regards,

Kevin

Hello @kevin ,

My JVM heap is 8 GB. Between this issue was seen in 1.2.X and in Dev , a different GC error came.
Anyway , i will try all your suggestions and try to get a better idea on the problem.

Thanks
Vineeth

Between each historicaltimeseries doucment has following information

  1. 6 Identifiers
  2. 6 Company information (Company names and all)
  3. 9 attrbutes value.
  4. And a very big name (The entire string dump of identifiers and company name)

HTH

Thanks
Vineeth

Vineeth,

Thanks for the information. When you get a chance I would appreciate knowing the total number of time-series currently in your master. I’m also curious to know what happens when you attempt to perform an identical search operation for related time-series in your own code against your OG installation: whether it succeeds, how long it takes, and the number of related time-series actually returned (to eliminate the possibility that an abnormally large number of time-series are somehow being flagged as related).

In the meantime the UI fix that limits the number of related time-series returned has been pushed to dev/v1.2.x as well, as requested.

Regards,

Kevin

Hello @kevin ,

I did a rigorous debugging and here are my findings.

The query executed on the DB -
2013-04-13 13:46:23 EDT LOG: execute : WITH cte_docs AS ( SELECT id FROM hts_document WHERE ver_from_instant <= $1 AND ver_to_instant > $2 AND corr_from_instant <= $3 AND corr_to_instant > $4 AND id IN ( SELECT DISTINCT doc_id FROM hts_doc2idkey, hts_document WHERE hts_doc2idkey.doc_id = hts_document.id AND hts_document.ver_from_instant <= $5 AND hts_document.ver_to_instant > $6 AND hts_document.corr_from_instant <= $7 AND hts_document.corr_to_instant > $8 AND idkey_id IN ( SELECT id FROM hts_idkey WHERE (key_scheme = $9 AND key_value = $10) OR (key_scheme = $11 AND key_value = $12) OR (key_scheme = $13 AND key_value = $14) OR (key_scheme = $15 AND key_value = $16) OR (key_scheme = $17 AND key_value = $18) OR (key_scheme = $19 AND key_value = $20) ) ) ORDER BY id ) SELECT main.id AS doc_id, main.oid AS doc_oid, main.ver_from_instant AS ver_from_instant, main.ver_to_instant AS ver_to_instant, main.corr_from_instant AS corr_from_instant, main.corr_to_instant AS corr_to_instant, nm.name AS name, df.name AS data_field, ds.name AS data_source, dp.name AS data_provider, ot.name AS observation_time, i.key_scheme AS key_scheme, i.key_value AS key_value, di.valid_from AS key_valid_from, di.valid_to AS key_valid_to FROM hts_document main INNER JOIN cte_docs ON main.id = cte_docs.id INNER JOIN hts_name nm ON (main.name_id = nm.id) INNER JOIN hts_data_field df ON (main.data_field_id = df.id) INNER JOIN hts_data_source ds ON (main.data_source_id = ds.id) INNER JOIN hts_data_provider dp ON (main.data_provider_id = dp.id) INNER JOIN hts_observation_time ot ON (main.observation_time_id = ot.id) LEFT JOIN hts_doc2idkey di ON (di.doc_id = main.id) LEFT JOIN hts_idkey i ON (di.idkey_id = i.id) ORDER BY main.id
2013-04-13 13:46:23 EDT DETAIL: parameters: $1 = ‘2013-04-13 17:46:23.13104’, $2 = ‘2013-04-13 17:46:23.13104’, $3 = ‘2013-04-13 17:46:23.13104’, $4 = ‘2013-04-13 17:46:23.13104’, $5 = ‘2013-04-13 17:46:23.13104’, $6 = ‘2013-04-13 17:46:23.13104’, $7 = ‘2013-04-13 17:46:23.13104’, $8 = ‘2013-04-13 17:46:23.13104’, $9 = ‘CUSIP’, $10 = ‘360271100’, $11 = ‘FREQUENCY’, $12 = ‘MONTHLY’, $13 = ‘ISIN’, $14 = ‘US3602711000’, $15 = ‘permno0’, $16 = ‘11658’, $17 = ‘permno1’, $18 = ‘ABC11658’, $19 = ‘ticker’, $20 = ‘FULT’

Its output as text file is around 400 MB.

Now the real issue is not anywhere here.
The JSON made for the call is around 91 MB.
And the browser is somehow not able to parse this JSON.

Looking into the JSON , there are 4 Lakh+ entires in the related field.
Further looking into it

grep object_id bad.json | wc -l
416043
grep object_id bad.json | sort | uniq | wc -l
138681

This means there are atleast 3 duplicates for all entries.

grep -A5 ‘DbHts~174301’ bad.json
,{ “object_id”: “DbHts~174301”,
“data_field”:“ProfitMargin”,
“data_source”:“BLOOMBERG”,
“data_provider”:“BLOOMBERG”,
“observation_time”:“CLOSE”
}

        ,{   "object_id": "DbHts~174301",
            "data_field":"ProfitMargin",
            "data_source":"BLOOMBERG",
            "data_provider":"BLOOMBERG",
            "observation_time":"CLOSE"
        }


,{ “object_id”: “DbHts~174301”,
“data_field”:“ProfitMargin”,
“data_source”:“BLOOMBERG”,
“data_provider”:“BLOOMBERG”,
“observation_time”:“CLOSE”
}

Now coming to the obvious question. There are 121 related fields for this timeseries. But why on earth is it trying to get all these object ID’s ?
Also by the looks of it , i guess its dumping all the timeseries realted information into the JSON.

Hope these information were useful. I would be only glad to give you more info on the same.

Thanks
Vineeth

Also here is the bad JSON - http://www.2shared.com/file/QM9WVZC7/bad.html

Vineeth,

Thank you for identifying this problem. Clearly this should not be happening, as the duplicate entries are superfluous. I have filed PLAT-3333 on Jira, which will be resolved as soon as developer resources are available.

Regards,

Kevin