Posted
about 13 years
ago
Having a mix of public and private source code repositories poses an interesting challenge to any organisation. How do you achieve the necessary combination of openness (which is assumed by Git at an architectural level), and the necessary level of
... [More]
confidentiality for clients and commercial components?
In this post, I wanted to shed some light on how we’ve managed to address that challenge and come up with something that works pretty well for us at OpenGamma.
Why we chose Git in the first place
When we first started OpenGamma, the version control system that I was most comfortable with was Perforce. While Perforce is a great system, it didn’t really seem appropriate for an Open Source project. Although the Perforce client software is free to download, we felt that the Open Source community in general frowned upon using proprietary software for version control. We did consider Subversion, but given that the world seemed to be moving to distributed version control, we decided to take the plunge with Git. We immediately decided to use GitHub as a host for our repositories as part of our general approach of using hosted services where possible.
While OpenGamma is primarily an Open Source platform, we do have some proprietary components only available to commercial customers. In addition, our support methodology allows us to share code directly with customers to provide hands-on, proactive support. Unfortunately Git is really not designed to support a mixed public/private model. What we really want is to be able to wall off access to certain parts of a repository, but when you think about it more deeply, it’s actually pretty understandable why Git doesn’t let you do this. For example, if you commit across public and private code, is it acceptable to even expose the file names of the private files? Probably not. What about history? What if something is moved from private to public or vice versa? Do you want all these cross-linking chains of commits? Probably not.
Because OpenGamma is a heavily componentized system, we’ve always operated on a quite fine-grained project basis, and have formed a moderately complex tree of dependencies between modules (primarily managed by Eclipse and Ivy). When we first started, the easiest way to manage the public/private split seemed to be to set up a separate git repository for each project. We also had a top level repository, OG-Platform, that contained the common top-level build files and configuration. The idea was that you cloned from OG-Platform and then cloned each sub-project into the projects directory underneath it.
That worked relatively well until we got to about 10 projects, at which point we found that it took so long to commit changes across all projects that someone else had committed in the meantime and you had to merge again. We also found it was very easy to forget to sync up some of the projects and you’d get strange errors when a rarely used project was slightly out of date.
To address these problems we did two things:
We merged all the Open Source projects (excluding Fudge, the self-describing binary message encoding system we are the primary sponsors of) into a single repository, OG-Platform. OG-Platform would now contain all the Open Source projects in the projects directory straight away after a clone. This hugely reduced the number of repositories we needed to manage.
We added some tools to perform operations across repositories. The ant task ‘clone-or-pull’ does either a clone, if the repository is not present, or a pull if it is. We later added ‘ant pull’ to do just the equivalent of ‘git pull origin master’ on every repository, and ‘ant status’ to do cross-project status reportings. Lastly we added a bash script ‘gitstats.sh’, which (parameter is the root directory, e.g. ‘.’) reports the status in a more compact tabular form.
Putting it all together for a public release
We still had a problem though: how to provide different sets of repositories to different users? For example, a user who was paying for our Excel integration module might be able to access the code for that, but it isn’t part of our Open Source release. One approach would have been to just attempt to clone [or pull] each of the git repositories we had, and rely on the permissions to fail if it wasn’t available, but this felt ugly and could potentially lead to information leakage about our private client projects.
To get around this we came up with a system that downloaded a small set of ant build and properties files from a web server, using HTTP authentication provided during the build. Open Source users would use one user name and password (the defaults offered), but commercial customers could use other user name and passwords to get customized build setups. This is where the ‘ant init’ task comes from. It allows the user name and password to be set either by the user as an environment variable, or specified on the command line via an interactive prompt.
What this means is that we can now provide a completely customized set of projects to each customer based purely on a user name and password combination. And as long as we’ve set up their permissions in GitHub, they can clone the code to wherever they wanted.
A slight downside, however, was that it added a step to an already fairly complicated build process. To address this, we’re changing the behaviour for 1.0 to default to the Open Source user name and password unless one is explicitly set up. We also made the clone-or-pull part of a single-step build process, meaning there’s truly one command after the initial clone to build a binary installation. Lastly, we’re changing Fudge, which is now pretty stable, to be an external dependency. This means that out of the box, the Open Source release only requires a single clone rather than three.
What we’ve come up with seems to satisfy all our immediate requirements. It’s not perfect, and we considered many other approaches (e.g. sub-modules) before settling on this. With the right attention to detail in smoothing the path for Open Source users while allowing commercial customers access to the extra code they’re paying for, it seems to be a successful approach for us.
Of course, there’s always more than one way to skin the cat. Have you faced similar issues in your project? How did you solve them? Share your experiences in the comments below. [Less]
|
Posted
about 13 years
ago
As an Open Source business, we often get asked what our business model is (or, more directly: How do we make money?). I like that question because it invites us to talk about the relationship with our clients, which is probably the key element of
... [More]
what our commercial clients get and value.
Open Source is not only a better way of developing and delivering software, it also allows for a much deeper relationship between the vendor and the client. The openness in code and architecture encourages openness in other areas too, and creates a more collaborative relationship.
All the firms we work with - banks, hedge funds, inter-dealer brokers, etc. - are fed up with poorly-supported solutions with output from out-dated software that no longer reflects the needs of the people using it. Building in-house is an option, but it's becoming increasingly difficult with tight budgets and hard-to-hire talent. Most good investors realise that they aren't, and shouldn't be, great software houses. So an Open Source solution is a desirable alternative, providing the transparency and openness you get from an in-house system, with the development skills of a dedicated software firm.
No need to go unsupported
But Open Source doesn't mean you are on your own. With an OpenGamma subscription you get high-quality support and maintenance from relevant specialists who know your installation intimately.
We have dedicated staff with deep technical knowledge supporting our clients. With specific insight into your installation, from initial set-up to on-going maintenance, upgrades and expansion, they are at hand to ensure you have a stable and current solution that evolves with the rest of your infrastructure and business activities.
Access to expertise
And we really don't like just giving you a phone number to call and pray someone is there if things go wrong. As part of our own learning experience, we would like you to call us not just when something fails, but also when you simply wish to exchange ideas, or question the way we've implemented a particular feature.
When you call, you get access to the engineers who actually built the system, which is great for both parties. You can talk to someone who knows what they are doing and can explain why we have done what we've done (which of course is transparent), and we learn what our clients are working on and what matters to them.
We believe this dialogue is much more than effective support. It forms an important part of the community - or ecosystem - of users and developers and benefits us all. We learn what's happening in the market and can share that with the community at large, as well as feeding it into the Platform to create a better solution.
As we focus on the technical issues that are common to many, rather than that which is proprietary, we think this sharing of information helps all our users to understand what works and how, saving money and improving quality for all.
The OpenGamma Platform is also designed for integration, with APIs to your other systems. This means we have a lot of experience working with different solutions and usually have a pretty good understanding of the relevant strengths and weaknesses of those we integrate with. We're not tied to other software or hardware vendors, but we are happy to share our views and recommendations with our clients.
Let's talk about quants
Another aspect of being a client is that you have access to our quants. This is a highly regarded team of quant specialists with diverse, but highly relevant backgrounds. They have all come to OpenGamma to be able to work on something that is state-of-the-art and has industry-wide impact. We love to offer them up to our clients who engage with them directly - sometimes clients come by for a "deep dive" chat, and sometimes to sound us out on how to generate the analytics for a certain trade for example.
As with our developers, our quants are not here to replace you, but rather to provide the tools for you to do more and better.
Commercial transparency
The OpenGamma Platform can be installed in parts or enterprise wide. What is built to support large investment banks or huge hedge funds is also available to smaller firms, on a smaller scale. We typically don't charge large upfront fees, preferring instead to spread the cost out according to a clear subscription agreement that we negotiate upfront depending on the complexity of your installation and the level of involvement from us. You don't pay per-user licences or per-incident support fees, and we aim to make the cost both predictable and transparent. We are convinced the OpenGamma ROI is very compelling and would be happy to discuss it with you in more detail.
Finally, having access to the source code frees you from the tyranny of vendor dependencies. If you don't find being an OpenGamma client of value, there is nothing tying you to us; you are free to keep your installation and continue using the Platform! Kind of keeps us on our toes too. [Less]
|
Posted
about 13 years
ago
What a day for Open Source and the financial services industry at large: Bloomberg announced yesterday it is opening up its market data API under the open source MIT license - in stark contrast to its competitors.
The implications for OpenGamma
... [More]
users are huge, and positive. We know many of you evaluating the software already have Bloomberg terminals available, yet due to Bloomberg's previous policy have been forced to rely on mock data. We've provided some of you with the integration module for evaluation purposes under a proprietary license, but we also understand that many of you prefer to self-evaluate the product without having to contact us. Bloomberg's announcement will allow us to change that.
Where's 1.0?
We know you've all been anxiously waiting for the 1.0 release of our Platform. It's actually been broadly stable and feature complete for a while now; we already run it internally 24/7 and have firms using it in anger. The few remaining pieces are mostly related to documentation and examples.
After yesterday's announcement, our goal is to go back and amend our 1.0 plans to include the Bloomberg Integration Module - we'd like to ensure there are evaluation tools ready with the upcoming release. As soon as the module is ready to be released open source - assuming this will happen before the 1.0 release - we'll aim to publish it on our GitHub account, as well as make an announcement on the forums. The module will then allow anyone with a valid Bloomberg terminal or Server API instance to directly access Bloomberg data from within the OpenGamma Platform.
We've been working on and with the OpenGamma Bloomberg Integration Module for several years now, and the difference in evaluation will be pretty extreme. Our goal is for you to be able to download the Platform, connect it to a Bloomberg Terminal running on your desktop, and have real-world analytics and risk in a matter of minutes. We think that's worth waiting a few more weeks for! (Want to make it happen faster? We are hiring.)
We are now working on the final preparations, and will be updating you shortly - stay tuned. [Less]
|
Posted
about 13 years
ago
What Was Said Last
In my last post I briefly outlined some performance gains we were achieving in our tuned dense BLAS library. Evidently this was received with mixed opinion! Tuning Java does seem like an insane idea, and indeed some people pointed
... [More]
this out, but given that these are our constraints:
Code has to be as fast as possible.
Assume the hardware is fixed and tune to it.
Code has to be in Java: no JNI, no hitting Fortran/BLAS/<insert thing we'd normally use here>.
So we don't really have a lot of choice! Once every piece of algorithmic tuning has been taken into account, the main point of performance death is going to be JVM and the fact that the JVM doesn't provide primitives those of us from the scientific computing world are used to, both of which are immutable as selections. However, trying to force their combination to produce something reasonable performance wise (as exhaustively pointed out by the people at LMAX) requires effort in terms of the acknowledgment of the hardware and a trip back to basic computer science.
What's Next
Having been relatively successful with increasing performance in dense BLAS kernels, we turned our attention to the invariably hard sparse BLAS operations: It is our notes and observations on this topic that form the bulk of this article. For those unfamiliar with the notion of a sparse matrix, a sparse matrix is categorised as a matrix with a only a small number of non-zero elements (yes, defining small is part of the problem). Sparse matrices arise a lot in numerical methods and indeed as a result of certain mathematical operations which tend to create some form of diagonal dominance, but this is not strictly so. Additionally, there is no real condition on the pattern of the non-zero elements or what constitutes a sparse structure.
Why?
The reason that we categorise sparse matrices separately is for gains in both performance (we don't need to do operations involving zero entries) and memory foot print (we don't need to store zeros). For example, multiplying a sparse matrix by a vector should proceed as a linear function of the number of non-zero elements of the matrix as opposed to the dimension of the matrix. These performance gains can be carried over into other linear algebra operations but as we'll see later this is easier said than done, particularly with the utter lack of memory manipulation presented by the Java language.
Storage Formats
First, we shall address the manner in which to store the sparse matrix data. Most Java programmers when asked this question reach for a Map implementation, and rightly so, a Map<Pair<row number, column number>, value> would certainly suffice and be very easy to use. The problem, as always with built-in overkill, is that it is probably slow. Notably, the more we play with getting performance from Java, the more we have come to the conclusion that if the Java code doesn't look like a variant of Fortran, it's going to be slow. So although a Map has some advantages, as we'll see later, the bloat is a speed killer.
From the idea of a map, we deduce that to describe the matrix all that is needed is a tuple of <row number, column number, value>. The Coordinate Format matrix (referred to as COO) is perhaps the most basic sparse matrix storage format and it directly reflects this tuple layout by storing the data as three vectors, two int pointers (one to row indices and another to column indices) and a double/float pointer to the corresponding value. Using the following matrix as an example:
1 3 0 0 12
0 4 6 8 0
2 0 0 9 13
0 5 7 10 14
0 0 0 11 0
A COO matrix would store the data as:
int[] columnIndex = {0, 1, 4, 1, 2, 3, 0, 3, 4, 1, 2, 3, 4, 3};
int[] rowIndex = {0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4};
double[] data = {1, 3, 12, 4, 6, 8, 2, 9, 13, 5, 7, 10, 14, 11};
This example highlights two important points. First, the memory footprint of the stored data is larger than that of the original data (3*14 vs 5*5). Second, there is exploitable information redundancy in the rowIndex variable.
Our next matrix format is the Compressed Sparse Row storage format (CSR), which takes advantage of this redundancy by solely storing a row pointer from which it is possible to compute an offset into the data. This row pointer is formed by keeping a cumulative count of the number of non-zero data elements encountered up the start of a given row in the matrix. This is probably best described with an example, so continuing with the above matrix:
int[] columnIndex = {0, 1, 4, 1, 2, 3, 0, 3, 4, 1, 2, 3, 4, 3}; // This is the same as for COO
int[] rowPtr = {0, 3, 6, 9, 13, 14}; // This effectively stores a cumulative data count
double[] data = {1, 3, 12, 4, 6, 8, 2, 9, 13, 5, 7, 10, 14, 11}; // This is the same as for COO
Note that the rowPtr variable contains (number of rows + 1) entries. This is to account for the number of elements in the final row and to allow easy computation of the number of elements in a row for the sake of induction variables that index in the column space. For example, to walk through the data in row 2 (assuming 0 based row indexing):
int index = 2;
for (int i = rowPtr[index]; i < rowPtr[index+1]; i++){
data[i]; // accessing data belonging to a row
}
From this, is it clear that due to the storage of an additional rowPtr index, the same code can be used for all the rows, which considerably simplifies algorithms.
In addition to the CSR there is an equivalent storage format whereby the rows and columns reverse roles such that offsets are computed by column, and rows are used to give the locations within each offset. This storage format type is known as Compressed Sparse Column (CSC). By inspection, it is obvious that CSR and CSC formats will have different total information content for non-symmetric matrices. However, both will have a lower information content than COO (in most real world cases).
So we've visited the basic matrix storage formats for sparse matrices. Unsurprisingly they are also present in Python, GNU Octave and hardened system maths library collections like suitesparse. Now for the interesting bit...
Why Sparsity Is A Pain
Sparse matrices are an invariable pain to deal with. As demonstrated previously, it's possible to actually increase memory usage if the matrix is insufficiently sparse for the storage format chosen. However this isn't the main gripe with these matrices.
The main problem arises from the fact that the code will execute on hardware and this hardware behaves in a certain way. Without going into the ins and outs of processor behaviour and memory hierarchy performance, the full deviousness of tasks involving sparse matrices is hard to explain. However, for now we need to bear in mind two things:
The processors can quickly access data so long as the next element to be used is stored in memory directly after the element presently being accessed.
That operations are best pipelined by making use of a mix of instructions, and that for maximum throughput on the floating point units the data should be coalesced as SIMD operations (as seen in my previous post, SSE isn't that great in Java).
Now, if we take as an example the CSR format matrix "A", containing the information previously presented, and want to do a sparse matrix vector multiply operation (SpMV, as in, result:=A*x), we can use the following code:
Performing a bit of basic analysis on this algorithm with regards to points 1) and 2) above, we can note that the variable "x" is being accessed by the index dereferenced by the column index. This does not bode well for point 1) above. Although values[], result[], rowPtr[] and colIdx[] are being indexed in a stride 1 fashion, "x" is not, and random data access is rather slow (pipeline stall, lack of hardware prefetching etc.) As to point 2), if the data cannot be fed to the floating point units fast enough, then their SIMD nature cannot be exploited (ignoring the lack of Java's ability to easily do this - it's a language invariant problem).
To attempt to improve the speed of this operation, we threw the standard set of techniques at the code (loop unwinding, strip mining, etc) and even weirder attempts like trying to force prefetching via loop skewing/peeling and reducing indexing to single bytes. However, none of this could overcome the inherent slowness caused by poor locality of reference.
A Quick Look at Basic BLAS Results
For our own interest we compared the OG implementations of COO and CSR SpMV with equivalent operations from:
Colt - has sparse support and uses threaded SpMV().
Commons Math - has sparse support
Universal Java Matrix Package (UJMP) - has sparse support
The following were not used (reasons given):
Efficient Java Matrix Library (EJML) - no support
Jama - no support
jblas - wrapper to native code, cheating!
JScience - broken
Matrix Toolkit Java (MTJ) - no support
OjAlgo - no support
Parallel Colt - wrapper to native code, cheating!
For the tests we used a set of increasingly large matrices with a fixed sparsity and ran essentially the same setup as before (harness code thrashes cache, warms up JVM, does multiple timings. Machine is a reasonable desktop machine). The results we obtained are below, each pair of graphs corresponding to the percentage of non-zero entries as given in their titles. The horizontal axis of each graph indicates the matrix size being tested (matrices were square n x n). The graphs in the left column present the raw timings obtained and are plotted on a log10() scale in the vertical axis. The graphs on the right show the relative speeds of the sparse matrices from the tested packages (as mentioned above) against the two OpenGamma sparse matrix types, the dashed lines are comparing against OpenGamma COO and the solid lines against OpenGamma CSR, again the vertical axis is at log10() scale. The plots referring to 25% and 50% non-zeros are missing results from the jScience implementation as the run times were so long (running into days) that we stopped them!
As a small post analysis, maps have an advantage in that adding elements is relatively easy whereas adding elements to CSR or COO is less so and can involve some quite expensive memcpy()ing, evidently we haven't tested the performance of this yet! However, if the data structure is immutable, as it often is in numerical applications (as a result of some properties of a numerical model, i.e. element density and ordering), then operations involving the fixed sparse matrix are of more importance. Furthermore, a large number of numerical methods rely on the repeated computation of matrix vector products, for example, a huge number of Krylov subspace methods, and so it is of importance that this kernel is fast. Evidently it is this sort of operation that was tested above and the OpenGamma implementations perform relatively well. It is also interesting to note that throwing threads at the problem, as done in the Colt library, does not actually increase the performance of the operation unless sufficient arithmetic density is available within the data set. This result can be seen first in the n=900 and n=1000 matrices with 10% non-zeros, it is at this point that sufficient work is present for the threads to be kept busy (perhaps as a result of the less false-sharing or better out of core cache characteristics). Finally, for reference, an additional screen shot of jConsole (below) shows that the CPU usage is incredibly poor and this is like due to the poor locality of reference which inherently hinders the floating point operations (the sudden jump in the number of threads used is due to the threaded Colt library).
Things That Are Going To Make Life Hard
This leaves us in an interesting place, and it turns out that solving the previously discussed locality of reference problem also assists with other associated problems occurring in functions such as matrix decompositions. Given that most whole matrix operations can be rewritten in terms of a row or column permutation of the matrix (assuming infinite precision arithmetic) we can try to apply techniques to create a more amenable form with respect to a given performance model. For example, in the case of matrix vector multiply, it would be ideal if we could find a permutation that moved the data in such a way that contiguous blocks formed near the diagonal such that at least some stride 1 access in the "x" vector were possible. It turns out that a similar behaviour is also needed for algorithms such as sparse LU decomposition and sparse Cholesky decomposition. This is to reduce the number of fill-ins (writing data to an originally non-zero part of the matrix structure) caused; basically the same permutation idea but this time with a different optimisation criterion. Amusingly it transpires that this problem is NP-Hard. However, in large part thanks to the GPU community considerable research has continued into this area as any poor locality of reference problem on a CPU is considerably worse on a GPU. Furthermore, a number of further highly specialised storage formats making use of combinations of compressed sparse row, blocks and multiply packed indexes have been derived. These formats will also increase performance but generally rely on permutations found by attempting to solve the NP-Hard problem and having a sufficiently high level of arithmetic intensity (due to a lot of data!) such that reordering and branching code costs can be amortised.
To demonstrate a possible bandwidth reduction on the previously used matrix, the permutation A(P,P) as given below tightens the matrix around the diagonal thus increasing the possibility of performance gain.
P = { 5 1 3 4 2 }
A(P,P) = 12 0 9 0 0
11 1 7 0 0
0 0 0 10 5
0 3 8 0 6
0 2 0 0 4
Conclusion
Essentially we reach an impasse. We are currently working on bandwidth-reduction and fill-in reduction algorithms to try and increase the performance possibilities within the sparse classes. Sparse decompositions are also a rather challenging area programmatically partly due to the algorithmic complexity required (building permutations, computing up-date trees, pruning trees, updating partial matrices, etc.), and partly due to the lack of pointers in Java making life generally hard work. Still, we shall persevere, and next time I write we should hopefully have some working prototypes. [Less]
|
Posted
over 13 years
ago
We've been overwhelmed since we went public with the OpenGamma Platform with the number of developers, analysts, and risk managers who want to combine our data management and calculation capabilities with the statistical power of the R environment.
... [More]
Much like we decided to actively embrace Excel as a front-end for driving an OpenGamma Platform installation, we thought that if people wanted to use R, it was up to us to make sure that we had the best support for it possible.
As the first public sample of just what we're getting at, consider the ChartSeries3D component. The example they provide is plotting a year's worth of yield curves, a pretty ideal way to showcase that rather than pulling this data from a file, you can pull it directly from the Historical Time Series service in an OpenGamma Platform installation.
I wanted to start with perhaps the simplest thing possible: plot the market data points that form your yield curve definition over the course of a year.
All that from this relatively small sample of R code:
In the next installment, I'll show you how we've modified this example to not only plot the raw market data, but to actually fit the yield curve using the OpenGamma Analytics library from data loaded into the R environment, and then bring back the fully fitted yield curves for plotting.
And that's before we show you the real power of what we're doing: driving full shocks, stresses, and historical regressions of whole portfolios from R and using R to analyze the results. We think R users in finance are going to find this exciting. [Less]
|
Posted
over 13 years
ago
I'm a new face here at OpenGamma, having joined this summer, and I wanted to introduce myself and what I'm doing here. My job here can be stated simply: make the OpenGamma maths library as fast as possible!
Having come to OpenGamma with a background
... [More]
in applications of high performance and accelerated computing in numerical modelling, I have a fair idea about how to make code performing mathematical operations run fast. However, doing this in Java as opposed to my beloved Fortran or C is a completely different game.
First, some comments about why on earth we are doing maths in Java. Well, "write once, run everywhere", coupled with the rather magic IDE behaviour of Eclipse (although I still prefer vi ;-)) means that the code written can be guaranteed to run without having experts in cross-platform compiling around to deal with any issues. The Eclipse editor also makes unit testing, debugging and general code writing considerably less error prone. Java also has the ability to call out to native code via JNI or JNA, both of which can be used in the future, should the maths library interfaces be written in a sane manner conducive to their integration.
What currently exists for maths in Java?
There are a number of well-established maths libraries for use in Java. Two of the most commonly used are Colt, and Apache Commons Mathematics, and it is these libraries that we currently use in-house. Having briefly looked through the source code, there are some features/methods that are lacking (for example, more advanced gradient-based linear system solvers) and experience suggests that the performance (in terms of speed) of these libraries is not particularly good in comparison to native code (expected) or more heavily tuned Java code (as will be seen later). On the positive side, both libraries are tried and tested and have different, but relatively easy to use, APIs.
Why do we want a new maths library?
At OpenGamma speedy maths is essential. Financial analytics uses a wide range of numerical techniques from very simple models and operations (basic integration and curve fitting) through to the considerably heavier traditional numerical modelling techniques. Clearly these models have to run in near real time or at least fast enough that their execution completes between ticks so that new analytics can be displayed soon after data arrives down the wire. Obviously, to achieve this, speed is key.
How are we writing the new maths library?
A large number of mathematical operations are essentially matrix related and in acknowledgement of this, the industry standard BLAS APIs are quite wonderful as building blocks. Furthermore, evidence of the importance of this API comes from the chip vendor specific implementations of BLAS (along with extensions to sparse matrices and FFTs). Sadly, there is no complete and heavily optimised BLAS library for Java. This is where OpenGamma comes in.
So what exactly are we doing?
In the last few months I have been working on identifying code patterns and techniques that the JVM can spot for optimisation. In general, these techniques are relatively simple and already well known, and are extensively used in the high performance scientific computing world. I've been concentrating on BLAS operations, specifically DGEMV() with alpha = 1 and beta = 0 such that the operation is essentially:
y:=A*x
where y is a [m x 1] vector of doubles, A is a [m x n] matrix of doubles and x is a [n x 1] vector of doubles.
Following this, a naive implementation of the operation can be written, and it is this version that appears in many Java linear algebra libraries:
So, from here we applied a whole load of optimisation magic and we've outlined this in a paper that you can download from here. Have a read; most optimisations aren't too hard to do and we found out some interesting things about the JVM.
Results:
In general, it is nice to know that it is possible to improve the speed of code, and indeed knowing what happens in the metal is worthwhile. All the results relate to the computation Y:=A*x for square matrix A of varying size. Implementation details can be found in the previously mentioned paper.
Comparison to Colt and Apache Commons Mathematics:
The purpose of this investigation was to find out if it was possible to improve on naive implementations of a simple BLAS operation, and to provide comparison to the Colt and Apache Commons Mathematics libraries. Results from racing OpenGamma code operating on fairly typical matrix sizes, against a) a naive double[][] backed matrix implementation b) Colt library c) Apache Commons Mathematics library are displayed below.
From this graph we can see that OpenGamma code runs on average at least 1.6x as fast as Apache Commons Mathematics. Furthermore, the OpenGamma code consistently runs at over 2x the speed of the Colt library code and peaks at around 2.4x. These comparisons demonstrate what is possible if you push Java hard and they give a hint of what is to come... watch this space, maths is about to get fast! [Less]
|
Posted
over 13 years
ago
Today we've released version 0.9.0 of the OpenGamma Platform, our comprehensive development platform for front-office and risk application development for quantitative finance.
While I know many of you were waiting for 0.8.1, we decided to do one
... [More]
better and wrap the fake data capabilities we wanted in 0.8.1 into 0.9.0. In the announcement on developers.opengamma.com, we explain why.
This release, expanding the calculation capabilities of our engine in handling a diversity of market data sources, drastically increasing the asset coverage of our data and analytics components, and continuing to refine our end user experience, is only 2 months since 0.8.0. We fully anticipate maintaining this pace of development to make sure that we're able to satisfy the needs of developers throughout financial services, both now and in the future! [Less]
|
Posted
over 13 years
ago
Since we got started almost exactly two years ago, we've had some significant milestones:
We raised two rounds of funding, from Accel Partners and FirstMark Capital.
We released 0.7.0, 0.7.1, and 0.8.0 of the OpenGamma Platform (and our first
... [More]
public release was almost certainly the largest new Open Source code drop in the previous year).
We moved into our first real offices.
We expanded the team dramatically, and are now up to 20 full time employees (and of course, we're hiring).
The last point is the most pressing on us, as we've now outgrown our warehouse space on Southwark Street in London. While we were hoping to stay here for two years, we've just grown too quickly for that.
Tomorrow the whole team will be packing up and moving about 200 metres away to our new Headquarters, at 185 Park Street, right next to the Tate Modern and still here in the Bankside area of London. The new offices are double the size of our current space, and should handle our growth as we continue to develop the Platform and work with customers to make sure they're making effective use of it.
However, there are a few public services that we host that will be down while we move servers from Southwark Street to Park Street:
The FudgeMsg server infrastructure will be completely down while we move (that means www.fudgemsg.org and jira.fudgemsg.org).
The OpenGamma Jira instance will be down.
OpenGamma documentation will be down.
Our Crowd instance will also be down while we move.
Because our Crowd instance handles developers.opengamma.com logins, those will not be functioning during the transition. However, downloads will still be available, and forums will stay up as well.
We hope the downtime to be as minimal as possible, but it may be all weekend if things go wrong. We'll announce on forums and via Twitter when everything is back up.
Of course we'll have a housewarming open house, so expect an announcement of that in the next week! [Less]
|
Posted
almost 14 years
ago
One of the most repeated comments that people have made regarding evaluating the OpenGamma Platform is that out of the box it doesn't do a whole heck of a lot. The problem and solution to that? Data, of course.
tl;dr: We're going to be shipping
... [More]
sample data for evaluation in a maintenance release on the 0.8.x codeline.
Any front-office or risk system is extraordinarily data-heavy, and heavy across a number of dimensions.
Security reference data (like equity sectors and OTC contract terms);
Portfolio/Position/Trade data (what trades you have);
Configuration data (how you want to build your curves and surfaces, what calculations you want to see);
Market conventions (how particular things are quoted or traded in certain markets);
Regions and Holidays and Exchanges and ... (the list is never-ending)
And all that? All that is before you even think about the lifeblood of modern quantitative finance: the market data itself. Getting market data, either live or historical, is pretty much a requirement for anything you'd want to use the OpenGamma Engine for.
And we're not giving any of it to you at the moment.
Sounds pretty daft, right?
Why Didn't You Do X?
We've had a number of suggestions on how we could have solved this problem; let me try to address those.
First, internally, we get the majority of our security reference data from Bloomberg (we're part of their developer program), and live and historical market data from our participation in the Bloomberg, Thomson-Reuters, and ACTIV Financial developer programs. So we have databases here with all the data you'd need to run through the same QA process we do daily. First suggestion we got is that we just export parts of our internal databases as "sample" data.
The problem is that we don't actually own that data. Consider the simplest thing: reference data on American listed equities, where Bloomberg doesn't own the data themselves. They still own the way they've organized that into a database, and thus they have rights on the data in our development, test, and QA database instances, which was sourced through that connection. We can only do this if we can source the data (and prove we did) from a provider that permits this type of redistribution (see below; we're trying to do just that), and it never touches any of our other processes simulating a bank or hedge fund's data infrastructure.
We've also had people suggest we should just put some random data in for market data. And that's a great suggestion, but the simple fact is that it won't work; if the data is far outside "real" market parameters, the modern quantitative models used simply won't fit at all.
You actually can produce random data that will fit the models, but you have to back out from the models and make sure that your randomization process will move in consistent ways so that models still fit properly. Basically, you have to produce "random" data that still satisfies certain statistical parameters.
But none of this makes sense until you start to consider the roles of data sources and data providers in this market, and realize why people pay thousands of dollars a month for a Bloomberg or Reuters terminal.
Data Is Money
The best way to understand what's going on here is to rethink the role of exchanges and how they make money.
Today, for most exchanges, actually processing trades has such razor-thin margins that in many cases it is a loss-making process. Simply put, the days of an exchange being able to make a significant profit from every trade done through that exchange is long gone.
So what do they make money on? The data. When you get stock market data from Bloomberg or Reuters or Yahoo Finance or anywhere, that data is actually owned by the exchange. Your data source (Bloomberg, Reuters, Yahoo) acts as a distributor for the data provider (the exchange). The data provider may allow the data source to give that away free of charge, it may charge the data source for that distribution (if it makes the data source money through advertising or something similar), or it may require that you have your own sell-through relationship directly with the data provider.
But that data is worth money, and the more up-to-date it is, the more valuable it is. The lower the resolution of the data (down to tick-by-tick in the Level 2 feeds), and the faster it's delivered, the more they can charge.
If we started giving you even a replay of an out-of-date tickstream, you can be sure they'd have their lawyers on us faster than you could say "Intellectual Property Violation."
Data Is A Money Making Machine
If you're an exchange these days, data is your business, and therefore worth money. But that's only a small section of the quoted objects out there.
What do you need if you want to build yield curves beyond the next couple of years? Typically, IR Swap or OIS rates. Those aren't exchange traded (although they're soon to be cleared and/or settled), it's an OTC market. FX option trading? You need a good FX volatility surface. Still OTC. Fixed income optionality? While interest rate future options and bond future options are exchange traded, swaption volatilities are OTC yet again. Want to build credit curves? Welcome to OTC Credit Default Swap country.
These markets almost all trade in a very similar way: you have large "flow monsters" sitting in the middle of the market (think: Goldman Sachs, Deutsche Bank, Morgan Stanley, Bank of America Merrill Lynch), and trading with almost all the other counterparties. They're acting as very traditional market makers in quoted markets, and making money on small spreads on each trade. It's a volume business, where more trades in general means more profit.
Now let's say you're a hedge fund. You trade a lot of swaps. But you don't trade anywhere near as many as one of the large counterparties; you're one of hundreds of hedge funds they're trading swaps against. They now have a view of the market that you simply can't get by virtue of being in the middle of so many trades, and they can sharpen every single trade they make by clever use of that data.
And so they don't give it to you. They'll give you enough to enable you to trade, but not so much that you can compete with them. For example, you can get Bloomberg-provided composite par swap rates provided by the brokers, but not the full rates available to the brokers themselves. They're so protective of this data that Markit Partners has a special service just for sell-side institutions to get monthly consensus that internal marks aren't too far off market.
For the large sell-side institutions in these markets, the data isn't just money, it's the machine that drives their entire profit base.
Even if we had this data, do you think they'd like it if we gave it away?
What We're Doing
So now that you know how we (as both a vendor and an industry) have gotten into this mess, what is OpenGamma doing to help you go from download-to-live-risk as fast as possible? How are we helping you evaluate the platform on your own without having to contact us?
First, we're going to start shipping with some sample portfolios. These will largely be confined to OTC derivatives and fixed income products (because we can manufacture the data).
We'll also be shipping with an artificial set of fake ticking and historical market data sufficient to price and run basic risk measures on for those sample portfolios. This data will not be real in any way, but it will at least allow for the models to fit for curve generation and pricing in a few currencies. Of course, the data generation utilities will be Open Source as well.
Where we can do so (because we own the data), we'll be including our own conventions and curve and surface definitions to show you how to get started.
We'll start working closely with the originators of reference data in the exchange-traded space (going straight to the exchanges) to either include that in a download pack, or provide an OpenGamma-integrated feed ourselves for evaluation purposes.
We'll also start to accumulate every "free" source of historical data we can and start making them super easy to integrate. Because each source will have different rules this will be an ongoing project, but I figure if you can get it from Yahoo or Google Finance, there's no reason we can't offer it ourselves to our developer community by going back to the original source.
It's too late to include this in 0.8.0 (we're code complete and in final release validation), but we're going to get this out as a maintenance release on the 0.8 codebase so you don't have to wait for 0.9.
So why in the world didn't we do that in the first place? We hadn't factored in the number of you who have told us that you're not allowed to play around with new technologies at work, or aren't allowed to hook up to any "official" data services without prior approval. So we know there are a lot of you who are doing your OpenGamma evaluations at home, or in stealth mode when nobody's looking over your shoulder. The moment you have to ask your friendly RMDS administrator for an account for a new experimental technology the gig's up.
We hear you, and we've all been there ourselves trying to evaluate new technologies without a manager wondering why you're looking at something new that's unsanctioned.
Expect the double-dot on 0.8.x with the sample data in a few weeks! [Less]
|
Posted
almost 14 years
ago
Since the 0.7.0 release announcement, and the 0.7.1 follow-up maintenance release, we've been pleased at just how many of you have decided to download the OpenGamma Platform. Some of you are downloading from home, but many of you are downloading from
... [More]
your firm. There have been downloads from hedge funds (large and small), investment banks, proprietary trading firms, commercial banks, asset managers, and systems integrators.
Although this is primarily a developer preview release, as we said in the original announcement, the system is actually a lot more functional than you can see in the 0.7.1 examples. In particular, the major components you need to quickly get up and running with real-world sources of data (our market/reference/historical data integration with Bloomberg, Reuters, and ACTIV data sources) we can't release for legal reasons. In addition, our Excel Integration Module, which exposes the entire Platform in Excel (and can be used for any arbitrary Java code, including your own), also can't be released Open Source.
Here's the thing though: we actually can let you have them as part of a formal evaluation or internal PoC. We just have to get some disclaimers (so that Bloomberg doesn't sue us if you violate your agreement with them).
But choosing to bring us onsite to talk with you and walk you through the status of the platform has additional advantages for you:
We're willing to sign NDAs, which allow you to talk to us about things your manager or corporate information security officers might not allow you to post on OpenGamma's Developer Forums.
We can fast-track your understanding of the system and how you could integrate your existing code and systems into the Platform.
We can find out what you're thinking of using the OpenGamma Platform for, and suggest which features (current or forthcoming) might be ideal for you.
So what's in it for us? Obviously, we think you'll be so impressed with our commercial capabilities that eventually you might write us a cheque or two. We're not going to lie about that. But this offer isn't about that.
Primarily, we want to talk to you so that we can make the Platform better. Every bit of feedback we get from you helps us prioritize the features and focus on the areas that you want us to focus. And we know that you may be far more comfortable telling us in person (rather than posting it on the internet), given the secretive nature of our industry.
So please feel free to contact us via our website, or just email me or Henning. We'd love to meet with you, and we think you'll find it just as useful if you're already looking at the code! [Less]
|