Understanding Workday Data
How to Analyze and Choose the Correct Schema for Multiple Workday Business Objects
This is the first of several articles about Workday and Workday Architecture that capture knowledge gained from my university’s recent first-time on-boarding to Workday Finance and HR. This is an article that I wish I had been able to find and read at the beginning of our multi-year project. Had the simple ideas presented here been foregrounded in some way, I might not have put in so much overtime. This article is primarily aimed at other Integration Architects, like me, but if you are an integration developer or a report writer on the Workday platform then this article may interest you as well. Your feedback is always welcome.
About Workday
Workday is a modern Software as a Service (SaaS) platform that has an origin story involving two executives fleeing the Oracle acquisition of PeopleSoft and starting a new company. What we have as a result of their effort is a decent Enterprise Resource Planning (ERP) system that has a lot of PeopleSoft DNA driving the choices the company makes for it’s platform and tools. Many of these choices make Workday, and the practice of operating Workday, more like a traditional monolithic application than a modern cloud-first platform.
Let me clarify this by massively oversimplifying what Workday is to me: A distributed architecture (think containers and queues and object storage and Java) with a closely held technical ecosystem for development and integration that puts integration squarely in the Service Oriented Architecture (SOA) realm, but lacking in the kind of tools and support that one might expect when thinking about such technologies. It’s SaaS and we aren’t supposed to be looking under the hood, so we get to use our imaginations to figure out what is going on when we’re trying to solve problems. Often we must reign-in those imaginations to meet Workday on it’s own terms or risk the success of our creations.
I will expound on Workday integration in future articles, but this article is specifically about Workday data from the perspective of someone trying to use that data in a report or an integration. That person will almost certainly find themselves trying to explain (oops! I meant “imagine”) just what the heck might be causing that crazy, expectation-defying thing happening right in front of them.
Workday Data
Let’s begin by recognizing some two really basic facts about Workday data:
First, Workday is an in-memory object data store. Workday uses SQL for persistence and NoSQL for document persistence, but you and I can’t access those directly.
Second, machinable data can be liberated from Workday via one of the three following methods:
- An Advanced Report or Search Report that has been enabled as a web service
- An existing SOAP or ReST API from the public Workday API catalog
- A manual export in the form of an Excel Spreadsheet, initiated in the UI by an end user
Workday’s various integration systems use one or more of these methods for extraction. To my knowledge, unless you are an implementer (Workday’s term for a partner consulting company) Workday does not support bulk data transfer.
The Sample Problem
We need to work through our analysis using a simple example. So, I wish to ask Workday the following question, which from here on I will call “The Sample Problem”:
Who are all of the Workers who have Academic Appointments and what Academic Unit is each appointment for?
I work for a university. I know from my understanding of my institution’s policies that a Worker can have more than one Academic Appointment, a single Academic Appointment made for a specific Worker is to one and only one Academic Unit, and an Academic Unit will collect many Academic Appointments. For those who don’t work at a university, this three-object relation is what identifies many of our workers as faculty.
Let’s take a traditional data modeling approach to the above question. An abbreviated entity relationship diagram for The Sample Problem might look something like the following:
Here we can see expressed the cardinality of the relationships described above. If I were to put this into a SQL database, each box would become a table and the primary and foreign key relationships would be applied by our data definition language. This is elementary, and at this stage of of our analysis we can roughly translate each entity (that is, each box) into a business object of the same name in Workday. In preparation for that exercise, the number below the name of each entity is the number of instances (rows, in SQL terms) that exist in real data at my university. The numbers in parenthesis in the field sections for each entity are the number of fields that exist on the equivalent Workday business object of the same name.
This is where we start to diverge from Workday. The ERD diagram above represents one single schema. The relationships are hard-coded as metadata, specifically the Employee_ID and Academic_Unit_ID fields that we would have to add to the Academic Appointment table, and that fact that we tell the DBMS they are actually relations by marking them as foreign keys, is inherent to our need to express the relationship of this one single schema. In workday, true cardinality can vary depending on how we assemble the objects.
We also start to run into trouble with the numbers involved. In most relational database systems we would never be able to come anywhere close to the 4500-plus fields that are on Worker if we try to describe them as columns on our hypothetical Worker table. Clearly something different is going on in Workday, but we need to change our notation now to capture what is actually happening. Let’s start with this object-relation view of these Workday business objects taken from a development tenant in use by my university:
I am about to make an argument for the existence of nine separate schemas based on this representation, but first we need to unpack a few things.
Workday Business Objects are represented here as circles, and their relationships to the other objects are represented as directional lines. The number on each line describes how many relationships exist between an object and the object that it points to, including for each of these objects at least one relationship with itself. In most cases, the relationship represented by an instance of an directional line takes the form of a field available on the business object that pairs with a field on the Related Business Object, which is the object that the arrow is pointing to. Worker has four relationships to Academic Unit, and Academic Unit has two relationships to Worker. In more general graph theory terms, Business Objects = vertexes and Fields = directed edges.
Thinking about it this way means we can start to analyze the complexity of this particular set of objects. I would like to make one observation before we start, however. Even though our three objects have multiple paths to each other, our usage of this data in Workday when we consume the API’s or write reports is typically using a subset of the available relations, so any analyses can be scoped to concepts that apply to Directed Acyclic Graphs, or DAGs. The heuristic news gets even better, because Workday reduces our options significantly when we go after this data. So while the above representation of our three objects creates 121 possible DAGs (ignoring self references) that involve all three objects, We don’t typically get to choose any field we want for the edge relation. Context does that for us. We do however get to choose which Business Object is primary, and that leaves us with two possible schema patterns when we want to use all three objects: Accessing each business object directly from the primary object, or accessing the fields of a business object through the relationship path going end to end. For simplicity think of this as a “Y”-shaped path vs an “L”- shaped path. This means three variations, one “Y” and two “L” patterns, for each object when said object is the primary object, for a total of nine possible schemas.
For those of us who don’t have the super-human ability to parse that last paragraph and visualize it correctly, here is a simple picture that might serve to illustrate this a little better. In the following nine patterns, the object on top will represent our primary business object in each schema:
If you are still with me, this is where things start to get interesting. The choice of schema determines the structure of the data that Workday returns. If you want a report or an integration to be correct, you must solve for and use the schema that actually relates the data correctly while answering your basic question, which in our case is The Sample Problem (remember that? Way up above…). To test each of these schema and evaluate them against the The Sample Problem we can build a a simple Workday Custom Report for each of the nine patterns and make a few observations about their suitability. I did exactly this, and the rest of this document will be spent analyzing the results and choosing which schema best satisfies The Sample Problem.
Schema A: Worker as Primary Business Object
This is a “Y”-shaped schema with Worker as the primary object. The actual schema looks like this in the Workday tenant:
In these examples, I use color to represent scope. For the above drawing the Academic Appointments and Academic Units fields reside on the Worker object and they each act as a Multiselect Field — a Workday term for one to many cardinality. Because these fields reside on Worker, I colored them and entire edge blue to match the color for Worker.
A report generated using this schema will return data that looks like the following in the Workday UI:
Most people would look at this report and easily interpret it. There are names in the Worker column, those people who have Academic Appointments show their Academic Appointment and Academic Unit assignments. Just what we wanted, right? Well, no. There is one very big problem. Using the “Y” pattern means that the Academic Appointment shown is not actually related by Workday to the Academic Unit shown in this output. In cases where a worker has more than one appointment, the academic appointments and academic units listed may not be in the correct corresponding order to each other.
This is mostly invisible to the human viewer of a report, but it’s really easy to see when we look at the XML output, presented here and cleaned of all noisy attributes to make the structure readily visible. I’ve repeated the color scheme from above to make it clear where each element group is coming from :
There are two Worker records represented in this output. The first Worker had no Academic Appointment, and therefore has no data beyond the contents of their Worker record. The second Worker had two Academic Appointments. Workday independently pulled the Worker relationships for Academic Appointment and Academic Unit, and does not relate them to each other in any way. If all you had were the XML, say, as the recipient of the output of a Workday Integration, you wouldn’t be able to solve The Sample Problem without making a risky inference; like assuming that the Academic Appointments come out in corresponding order to the Academic Units, something that you can’t guarantee. If this were the structure returned for use in a Workday Business Process, it would be tricky to figure out why that process was only correct “most of the time”. All that need happen to break your solution would be for Workday to get one of the items out of order, and your assumption and possibly your business process or integration would fail, or even worse, successfully deliver incorrect information.
This brings me to an interesting side note, which is that I think this scenario happens a lot. Someone gets an assignment like The Sample Problem, sits down to create a custom report in Workday, and starts with Worker, praying that they won’t have the hassle of trying to build a tricky calculated field chain for something they can’t reach. They are elated when they find both Academic Appointment and Academic Unit available as business objects on Worker, and they see report results that look correct. Relieved, they ship the report and move on. Then sometime later somebody notices that not all the data are correct, are mystified as to why, and a workaround happens. What is that workaround? look at this one line from the report above:
Do you see those blurred-out areas on the Academic Appointment and Academic Unit columns underlined in red? In both cases in our tenant, those are the Academic_Unit_ID. I suspect that at some time during our deployment, somebody looked at a report where multiple Academic Appointments weren’t lined up with their Academic Units, and the naming policy was adjusted to always include the Academic_Unit_ID in the visible title of the Academic Appointment and Academic Unit instances. Fixing schema is hard. Taking advantage of human pattern recognition is not, especially under the challenging resource constraints of an initial Workday on-boarding.
This kind of workaround is prevalent in our tenant, and may have even been a best practice handed down from our implementation partner. I never really asked anyone about the origin. I have personally written integrations that take advantage of this naming practice to get object ID’s that I couldn’t easily reach within the scope of an Advanced Report, opting to parse the ID out of the object name rather than build a calculated field chain. I’m not proud of that, but I am also not alone in this practice. I have also had the reverse headache of having to parse a name to remove sensitive information like this before delivering the data elsewhere.
At this point you might be asking what this schema can actually do, since we didn’t solve The Sample Problem. The one relationship we can count on here is that we know which Worker “owns” the individual collections of Academic Appointments and Academic Units. I could see this being important for testing purposes. You could ask whether there are any Workers who have a non 1:1 ratio of Academic Appointments and Academic Units. It might be valid for a Worker to hold more than one Academic Appointment to a single Academic Unit, or it may reveal Workers with Academic Appointments and no Academic Unit, etc… In other words, you can ask questions where you must deliberately break the expected relationship to reveal the wellness of instance data in Workday, and this looks like one viable strategy.
Now that we know what kind of analysis we are after, let’s look at the results of the rest of our schemas, with more abbreviated details.
Schema B: Worker as Primary Object, Academic Units pulled through Academic Appointments
This an “L”-shaped schema with Worker as the primary object, in which we access Academic Units via the fields available on Academic Appointments instead of using the fields available on Worker. The actual Schema in the tenant looks like this:
Note my use of language. I talk about pulling Academic Units through Academic Appointment. It simply means that in a Workday Advanced Report where I’ve chosen Worker as a primary object, I’m reusing the Academic Appointments field in the Object Column of the report definition, and then referring to the Academic Unit Object via the Academic Unit field that is on Academic Appointment instead of the Academic Unit field on Worker. This creates a path through the objects that must obey the field relationships.
Think for a moment about what the instance path of this pattern would look like in memory for a Worker with two Academic appointments. There would be one instance of the Worker, and Two instances each of the Academic Appointments, each having a single relationship to one Academic Unit. Two paths from the primary object Worker that structurally produce the correct data. This is exactly what we are going for. It’s so important that I am going to channel my inner James Carville and paraphrase this: “It’s the path, stupid!” Solving for any graph schema is always about the path.
Here is what the ouput of this report looks like in the Workday UI:
The actual row instances are different in this snippet from the one above because I have not applied any kind of filter or sort, so they come out the way they come out. Notice however that this is essentially the same output as Schema A. The big difference here is that we know the Academic Appointment and Academic Unit columns are related. This is clear in the XML:
So, this is it, right? Were we really lucky enough to find our schema on the second try! Technically, yes, this is in fact the schema we want because it satisfies The Sample Problem, but it has another problem that we haven’t explored yet. That problem is performance. Retrieving this data takes over two minutes! If you look back at the original object diagram above, you will notice that I included the number of Datasources available on each object. In Workday, most business objects have at least one Datasource. This is actually how we get to the data in that object. Each Datasource has a subset of the total available fields on the object, may or may not be indexed, and it defines the Workday security groups that allow for access. In these first two schemas, I chose the Datasource for Worker called “All Workers”, which is not indexed and which includes every worker whether or not they have an Academic Appointment.
Yes, Schema B is what we want, but we’ll need a modified version. If you are impatient you can skip down to the end of this article to read about what we can do to enhance the performance and limit the return scope. In my opinion, performance is a secondary concern to getting the correct shape for the data, so I am going to continue our structural analysis of the schemas before doing the meta-analysis that lets us choose the correct approach.
Schema C: Worker as Primary Object, Academic Appointments pulled through Academic Unit
This schema is the last of our Worker as primary schemas, and it reverses the order of the Academic Unit and Academic Appointments from Schema B. The actual Schema looks like this:
The first thing to notice about this schema is that we have to use a multiselect field to pull Academic Appointments through Academic Unit. Do you think this preserves or breaks the relation of Worker to an Academic Appointment? Let’s look at the report output:
Oh my. The answer, of course, is that the relationship breaks. You can see from this output that a single Worker is associated with a single Academic Unit but that said Academic Unit has 222 other associations. This is not the result we want. Here is the XML:
The Academic Appointment(s) for our Worker are in fact in this report in the Appointees column, but we can’t distinguish them from the other workers. Also, all 222 of these Academic Appointments are going to be repeated 222 times in the complete output, because each one will have a Worker who is in the left column and heads their own row that will in turn repeat the 222 appointees.
I do think this sort of output is useful if you are asking a question like “who does Worker X work with?” This of course, is not the same question as The Sample Problem.
Schema D: Academic Unit as Primary
Worker as a primary business object is always going to present a tradeoff between performance and restrictions on the data. I always make it a point to explore schemas that involve worker but with other objects as primary before making a final commitment. Let’s see what happens when we choose Academic Unit as the primary object. Here is the actual schema for Schema D:
And here is the output:
Once again, we have something that looks usable when seen in the UI. For each Academic Unit I get to see each Worker with their Academic Appointment. Except, that I’m not. This is another “Y”-shaped schema, and Academic Appointments and Workers may not correspond. We have the same problem we had with Schema A. It is entirely possible, especially with the large numbers here, that many of these Workers do not hold the Academic Appointment sitting on their row. Here is the XML to prove it:
It’s also worth noting here that the relationship from Academic Unit to Academic Appointments is only for current Academic Appointments. Consuming this data means managing effective dating so that you know what “current” actually means when you call for it. Finally, we’ll take this as another reminder that there is no relationship where there is no path. Let’s look at the next two Academic Unit schemas to see what we can learn about that.
Schema E: Academic Unit as Primary, Worker pulled through Academic Appointment
When we pull Worker through Academic Appointment, we are building a path from a Multiselect Field to a Single Select field. The is the secret to getting two columns on a path to relate to each other, and to get them to occupy their own line in the output. Thus:
Provides us with the following output:
This looks really close to the output we had with Schema D, but we know, because of the path, that academic Appointments correspond to Worker. Here is the XML:
That’s pretty clean. Each report entry has all the relationships we need.
Schema F: Academic Unit as Primary, Academic Appointment pulled through Worker
If we try reversing the Academic Appointment and Worker objects to reveal the schema for the second “L”-shape available to Academic Unit, we are presenting a path that ends in a Multiselect field. Note this in the actual Schema:
When we tried this with Schema C we ended up with something that wasn’t so useful, but this time, because Worker sits in the center of the relation, we get something a little more interesting:
Now the Multiselect field sitting at the end of the path is actually telling us something that is almost useful — namely, how many Academic Appointments each individual person has. We still have the repeating data problem we saw in Schema C, for example, the person in the output above with three appointments is going to appear two more times in the data under the other two Academic Units that she is appointed to, and each time all three of her appointments will be displayed when only one actually applies to the Academi Unit leading the row. Here is the XML:
I do want to weigh-in from the perspective of an integration developer. The repeat by Academic Unit would be an edge case that we would have to deal with, but this grouping, especially the fact that Worker and Academic Appointments are nicely grouped, makes this a very attractive schema. It wouldn’t be my favorite thing to try to write an XSLT transform for, but it would hash very nicely in a downstream system where hashing is possible.
Schema G: Academic Appointment as Primary
Schema G is a special case among the “Y”-shaped schemas, because it actually works. Here is the schema itself:
and here is the output:
Here, we know for sure that relationships hold across the entire row because there are no Multiselect Fields. There is only one Worker and one Academic Unit available for any one Academic Appointment. There are no repeats. Another way to look at this is that because each relationship is one to one, this is really a path, not a “Y”-shape. It doesn’t actually branch if there is no multiselect cardinality. Here is the XML:
Once again, it’s very clear that all of these are related, as long as you understand that there can never be more than one item in Academic Unit group. I’ve personally never been clear about why some fields like this, and like Worker for that matter, both of which are marked as Single Select fields in Workday, actually require grouping. Workday forces it when you build a report, but the cardinality and the field attributes behave as single select. For me, this makes Schema G a little harder to work with than Schema B. I have to carry that 1:1 assumption into my transforms and I can’t deduce it from the XML itself. I would want to build an XSD description that enforces this, or liberally comment my transform.
Schema H: Academic Appointment as Primary, Worker pulled through Academic Unit
Notice that in this schema we are traversing a 1:1 relationship followed by a 1:n relationship. We should expect this to behave like Schema C:
And, here is the output:
Yep, this is looking a lot like Schema C. We are still listing all the workers without actually associating them on the row. We have no idea which worker goes with the original appointment. The is the same repeating problem that was present in Schema C, but it’s worse here because there will be a repeat of all Workers for each academic unit for each Academic Appointment, rather than just for each worker. It swells the size of the output significantly. Here’s the XML:
Hmmm… No blue on this one. Where is Worker? Well, for some reason, Workday has pulled the Worker Name into an attribute of Academic Unit, instead of promoting the Worker object. You’ll recall that in our past schemas Workday chose to repeat the group element and use that as the container to house Worker with whatever it was matched to. Not here though, and I honestly can’t tell you why. The takeaway here is that it is important to examine the output of a chosen schema — sometimes the behavior will surprise you.
Schema I: Academic Appointment as Primary, Academic Unit pulled through Worker
Finally! We’ve reached our ninth and final schema. Again we have 1:1 to 1:n path just like above, where we should expect ambiguity in the Academic Unit. Here’s the schema:
And here is the output:
This is just as we predicted. A listing of every Academic Unit for a Worker with no match to the one belonging to the Academic Appointment. Here’s the XML:
Which once again confirms what we are seeing. No surprises here and nothing we haven’t seen already.
Meta Analysis: Choosing the correct Schema
It’s time to see what works best for The Sample Problem. As a reminder, The Sample Problem asks: Who are all of the Workers who have Academic Appointments and what Academic Unit is each appointment for?
Suitability Analysis
This is our first Heuristic. Before we think about performance or filters or any other aspects of engineering the retrieval of this data, we need to know if it can actually answer The Sample Problem. Here is how the Nine schemas fared:
As you can see, this narrows things down by a lot. We are left with Schema B, Schema E, and Schema G as the only three candidates that have all the relations and can answer The Sample Problem. Let’s move on to performance”
Performance Analysis
Well, this one is clear. The return time for Schema B is 123 seconds. The return time for Schema E is 14 seconds. The return time for Schema G is 4 seconds, over 30 times faster than Schema B! This brings us to a very important question: How is this report data going to be used? 123 seconds is too long for end users or synchronous activities, so if we had such an expectation we would clearly choose either Schema E or Schema G. If there is no need for speed, then 123 seconds is fine. For example, using Schema B in a typical Workday Studio integration that is simply shipping a result file somewhere in the middle of the night is well suited to Schema B.
Maintenance Analysis
For me, maintenance of reports and integrations comes down to a couple of important qualities: How well can someone else understand your work, and how much trouble is it to change? When we are thinking about data, the shape of that data contributes to the ease of maintenance in both of these areas.
Schema B exactly answers The Sample Problem. It shows Worker, Academic Appointment, and Academic Unit, in that order, with all relationships intact, and it does it for every worker. Any transformation into final output is linear, because the data are already in the correct shape.
Schema E returns Academic Unit first, then shows every Academic Appointment in the unit and the worker who goes with it. Workers are spread around the data set based on which Academic Unit they belong to. To answer The Sample Problem you will have to transform this data into the shape provided in Schema B. You will have to de-duplicate the multiple appearances of a Worker with more than one Academic Appointment, or coax them and their data onto individual rows for aggregation. You will need to make this clear in any transform or code. The maintenance cost is higher than Schema B.
Schema G returns Academic Appointment first and due to it’s structure guarantees one row per appointment. Technically it’s not in the correct shape, but it can be in the correct shape by switching the columns, something that is trivial in a transform.
So in terms of maintainability, I rank these from easiest to hardest as Schema B, Schema G, and Schema E.
Correct Data Analysis
Schema G returns exactly the number of instances of data that we want when answering The Sample Problem. Schema B returns extra data that we would need to filter away because it brings back every worker, whether or not they have an Academic Appointment. Schema E returns extra data because Workers have Academic Appointments that get repeated under Academic Units that they don’t belong to. This would be difficult to transform out of, and the transform that does it would be hard to decipher. I would only ever use this if I could hash the Academic Appointment to Worker relationship so that the duplicates disappear, and I would be dis-inclined to use it at all because of the need to do this.
Conclusion
I would choose Schema B for every scenario except for speed, and I would choose Schema G for speed. In choosing Schema B, I would also seek to make some enhancements. For example, switching a Workday Advanced Report from the “All Workers” datasource to the “Workers With Academic Appointments” datasource both reduces the output of Schema B to the exact set of data that we want and it brings the return time down to around 18 seconds. I think this puts it in the territory of acceptable for a Workday UI user. If I needed even more speed and could reach a compromise, say, of active workers only, I could take advantage of one of the indexed datasources on Worker and get returns in the sub-5-second territory.
Ok, let’s wrap this up. You must solve for schema when pulling data from Workday. This was an over-simplified example with only three business objects, but from this example we learn a few techniques:
- Always find a path from your primary object to your sub-object that keeps all of the components of the relationships you need in the data flow.
- Try to end your path with a Single Select Field. It’s the easiest way to get what’s on your path onto a unique row of output.
- Always look at the structured (XML or JSON) output of your report to determine whether Workday actually honors the relationship that you think you are seeing in the UI. Elements that are not grouped together are not related.
- Eliminate schemas that don’t have the relations you need before you worry about performance and filtering.
- Performing complex logic in a transform is a good sign that you have the wrong schema. Be open to taking another look if this is what you find yourself doing.
Happy Workday data wrangling, my friends!