This blog post is a continuation in the little series about CosmosDB, that is started with the post CosmosDB: NoSQL, NoDBA & NoProblem?. In this series I am looking at the different angles at the available information on the number of different APIs in this database.
In the past 9-10 months, since the official presentation of the CosmosDB, there has been constant announcements of the new API support, with SQL | MongoDB | Table | Graph | Cassandra being currently available as of the moment of this blog post creation and without any speculation about any further ones, I would like to focus on those on them only.
When you declare that you do support something, do you specify HOW MUCH OF IT you actually support ? Does it really matter ?
Do you think how your potential/actual customers might find it less amusing once they face/discover the real state of the affair?
The initial offering of the DocumentDB (CosmosDB’s old name) was including this SQL-like syntax that could be used for data retrieval in the CosmosDB.
Let’s run a couple of queries against the primitive JSON file that I have uploaded into the new collection:
The simplest of them all – the default one, which will select all the content and deliver the first 100 rows:
SELECT * FROM c
Works fine as expected.
Now let’s sort the data by the timestamp:
SELECT * FROM c WHERE c.date >= "01/01/2000" ORDER BY c._ts DESC
All is fine.
What about sorting data on the multiple columns:
SELECT * FROM c WHERE c.date >= "01/01/2000" ORDER BY c._ts DESC, c.granulartiy
Errors: Multiple order-by items are not supported. Please specify a single order-by item.
That’s a bummer and rather a very much expected feature, right ?
Let’s try GROUP BY (since I keep on seeing the CosmosDB being positioned as an analytical solution as well)
SELECT c.granularity, max(c.date) FROM c GROUP BY c.granularity
Surprisingly this query will fail:
Syntax error, incorrect syntax near 'GROUP'
meaning that the GROUP BY (and actually DISTINCT as well) are not supported by the Cosmos DB.
INSERT, DELETE & UPDATE
Do not exist in the engine at all. You won’t be able to find any documentation on them.
I guess that this shall break a number of Database Developer dreams of seamlessly migrating their existing applications to CosmosDB.
Whenever you will need to delete the documents, you better be ready to delete them all by deleting the collection or go through the thousands of them and delete each one by hand.
Right now to my understanding, you can extract up to 200 values with a single query, meaning that a query like this one:
SELECT TOP 250 * FROM c
would deliver 200 results at maximum. This is a very serious limit if this will get confirmed.
These are the most regular predicates that anyone constructing regular queries is using on the daily basis, but this functionality is not supported in the SQL API in the Cosmos DB currently.
Such a query as
SELECT * FROM c WHERE c.date >= "01/01/2000" AND EXISTS (SELECT TOP 1 * FROM c2 WHERE c2.Date = c.date)
Syntax error, incorrect syntax near 'EXISTS'.
That is not what I would expect from a modern database that is supporting SQL-like syntax. I understand that EXISTS is not the most trivial iterator to implement on the globally distributed platform, but hey – that is what the clients are already using and if you want to get them on your platform …
How on earth a web application supposed to function effectively without them?
Partial Document Updates
Very much needed. Sometimes, you just need to update a part of your JSON. Especially if your JSON is pretty big. Stay tuned, I will show in the next blog posts how fast it can be.
Nope. I believe these are the functionalities that are very far from being trivial to deliver in any platform, but that what is modern analytical platforms are providing.
Sorting on the Joined table columns.
Let’s try a following query, where I am trying to sort on the joined “table” c.pets:
SELECT f.id AS familyName, c.givenName AS childGivenName, c.firstName AS childFirstName, p.givenName AS petName FROM Families f JOIN c IN f.children JOIN p IN c.pets WHERE p.givenName = "Shadow" ORDER BY p.givenName
The error message is clear, and it would break the heart of a lot of database developers:
Order-by over correlated collections is not supported.
– Cross-Document Joins – Nope.
– Powershell? Forget it! You can create a new database, do a couple more things but there is no serious support for this language. See the documentation.
– CORS (Cross Origin Support Sharding) Support ? Not really.
Think about some other important functionalities, such as Identity, which are extremely difficult to implement in the distributed environments, and think about PaaS such as Azure SQL Datawarehouse, where the implementation shown that even though it might be brilliant, the real customers might not be OK with it. And they (customers) need it. Badly.
SQL API Thoughts
How much of the ANSI SQL does the original SQL API supports ?
It would be ignorant to believe that any vendor would support ANSI SQL entirely, but still a feeling, a percentage would be very nice, so that the clients would not start projects that will hit the wall with their head, because a particular functionality is not supported.
That would be an honest option, and the client would really appreciate it.
I am not qualified to go judge the MongoDB support at the moment, but given a truly great documentation page at Microsoft Docs –
MongoDB API support for MongoDB features and syntax, I believe that the gap between what is supported and what is not is getting extremely small. I do this judgement not out of this professional or tested opinion, but I base my opinion on the views of the specialist that I trust.
Not supported features that are really standing out are:
– $pull with condition is not supported. $pull with conditions is a very effective way of removing instances an existing array all instances of a value or values that match a specified condition, and overall it comes to me as a great optimisation of the same removal commands that are available.
– the support for the users and roles (this would prevent the heavy usage in the big enterprises and it is a strange situation, but the majority of the small applications should be fine with it)
– manual replication (i am truly fine with this one being missing, because we are running a PaaS, the service itself should decide when to replicate the data, based on the selected transaction consistency level – and I do love those 5 options of the consistency that CosmosDB provides. )
No problem here with the lack of the manual sharing, because again this is PaaS, but maybe some of the real users and tuners of the MongoDB will have a different opinion on this matter.
Still, I would love to have a concrete percentage of the supported functionalities and a good roadmap for the future.
Graph API is essentially a Apache Tinkerpop language Gremlin (graph traversal language) implementation.
The known to me functionalities that are not supported are:
– variables – this is really very upsetting for me. The Gremlin recipes are using the variables very extensively and the lack of the variables will prevent people of taking advantages of the Gremlin. Also I do not want to imagine how to write complex nodes and edges queries without being able to update every single part by hand. Sometimes, just sometimes, you might need to update big scripts, and you won’t enjoy the result if you have updated something wrong.
– local compute vs on Azure Compute calculations. Well, sometimes the calculations of the graphs and queries will be executed on your local machine…
This problem has a connection with used driver. This sounds like a significant problem to me – because most of the users won’t be able to figure out why the queries are taking hours instead of minutes to run.
– The data import is still in the preview, even though the GA (General Availability already took place 1 month ago). I am wondering how people using Neo4J, GraphDB, OrientDB or others are supposed to migrate to Cosmos DB.
And I would like to have the percentage of the actual API supported. Really.
This looks like a very superior replacement for the Azure Table Storage. Azure Table Storage might still keep some of the more particular usage patterns, but I expect the Table API to take over the future deployments.
What is the actual percentage of the features supported ?
I am truly wondering…
This API is still in the preview right now, so it is totally fair that the totality of the information is not available.
Particular Partial Percentages Thoughts
In my views, Cosmos DB APIs are being presented in the wrong ways. A technical specialist understands that different platform has different implementations and supporting 100% of the “foreign API” is mission impossible – it is too complex, too incompatible and too darn expensive.
A technical specialist understands that any re-compilation of the API implies delays and some performance delays because different visions and different implementations have different limitations.
Imagine the following situation as on the picture below:
Consider the following statement: on the left side you will find 3 glasses with water. This statement is technically correct. All glasses contain water. One of them is full, the other one is half-full, the third one barely has any water. You can still say that all of them are glasses with water, but it is not precise and you know it. You know what people will believe, but you advance with the information, because it is cool or simply because you can. A person who cares will still be able to point you about the differences, and if you want to be precise, if you want to make a true impact and be considered precise – you will specify the percentage of the water in the glass – wether it is 100%, 50% or 0.5%.
Let’s get truly technical and let us be precise about what is being supported and what is not.
I want to have exact percentages of the support for each of the APIs and not just a message that it is supported.
to be continued …