Back to Webinars
Welcome, everyone. Great to see you all here. I'm Andy, the marketing manager at Paligo, and I'll be your audience concierge for today's talk. Josh, if we could skip ahead, please. Josh Gershon, if you wouldn't mind introducing yourselves. Sure. I can begin. I'm Josh Anderson, and I'm one of the information architects here at Polygo, and I've been here almost three years. And I'm here with Gershon. Hi. Welcome. I'm Gershon Joseph. I'm a senior information architect and content strategist at Paligo. I've been here three and a half years. Well, you both. So for those of you watching, there's a Q and A window to your right where you can ask the speakers questions. We'll do our best to get to them all at the end of this discussion. But for any that we don't get to, we'll follow-up directly with you over email in the next couple of days. And also, if you can't stay for the full discussion, for anyone who registered, we'll send you the link to this recording afterwards. Okay, Gershon, Josh, I shall hand the stage over to you two. Alright. Thanks, Andy. So I'm here today with Gershon, and I wanted to have a kind of a q and a, with him about the process of migration. I think a lot of people are in this situation where they have all kinds of legacy content in formats such as Word or PDF or Google Docs or all these things, and they're interested in moving to a CCMS and adopting a structured content methodology, but they don't really know how to make that happen. What is this process of migration like? Gershon has a lot of experience with this, so I have a lot of questions I want to ask him today. Maybe we can start, Gershon, if you can give your background a little bit and let us know what kinds of roles you've had in the past and what kinds of industries and organizations you've worked with to do this kind of content migration work. Sure, Josh. So the roles I've had include chief information architect and content strategist, that was at Mastercard, chief enterprise architect, which was at Cisco Systems. I've had several roles where I've been senior technical writer, I've been a tech pubs manager, and I've been a solutions engineer. In terms of the industries, most of the industries, I've covered almost every industry, but my main, I would say focus, or at least time spent in the industry is defense, aerospace, networking, telecommunications, cybersecurity, financial technology and medical technology companies. Yeah. So that's a pretty big range of experience. And from that, you've come up with eight steps for migration. So maybe at a high level, you can give us an overview of, what this process looks like, and then we can dive deeper and go through them one by one. Yeah, absolutely. So the eight steps are firstly to perform a content audit, then to analyze the content from the audit, then to consider content delivery, how do you want your users to consume your content? Now you have everything you need in order to start mapping your content from the unstructured legacy to the structured 2B system. Then you plan the migration, And the next three steps are the migration itself, which is pre work you do to the legacy content, you prepare it for migration, then the actual migration itself importing into the CCMS. And lastly, once the content is in the CCMS, there's the post migration cleanup, which is always required because there's no automagical migration that just gets it in and it's ready to go. Definitely. I've seen that in my own experience as well. Well, let's start with the first step that you mentioned. You said that we have to start with a content audit. What does that really mean? The goal is to identify the content you actually need. So in order to do that, you need to list everything that you have currently. And you may want to add to that list the documentation or types of documents or content that you know you need. So very often there's what you have today and then there's planned future stuff that you haven't gotten around to yet. This is very common with smaller startups that are still growing. If you're a legacy enterprise, then you've probably already got everything, if not most of what you need. And what kinds of like tools do you use for a content audit? Is it something that really anybody can do, or do you have to go out and find some specialized tool for this sort of thing? Actually, everybody has access to a spreadsheet, so you can just use a spreadsheet for that. What you want to do is list all of your legacy content that you have, and then classify them by assigning attributes, what type of documents or what type of content are they? And do they apply to specific products, specific audiences, specific market groups, things like that, that helps feed your taxonomies and ontologies. You can then identify similar content that can open up the door for reuse in the future. And lastly, you want to identify structural components that you can reuse things like notes, warnings, steps that are very often reused. That's good to hear because I know a lot of people always wonder where they can find content that can be reused. So it sounds like at this audit step early on is is a good place for that, right? Yep. Perfect. Alright. So I know after the audit comes something else. Maybe you can tell us a little bit about this second step around analyzing the content. Sure, Josh. Step number two is content analysis. And here, primarily, you want to determine what you need and what you don't need. So you really only want to keep the content that you need. You don't want to bring a whole lot of legacy stuff over into your CCMS and then you'll never touch it. So I strongly recommend that you are as brutal as you can be in identifying things for, let's just call it deletion. Let them stay in the old system and only bring in what you really, really need. Of course, can always migrate some of that stuff in the future if it turns out you do need it. But you should rather on the side of not bringing content in than bringing in too much. Because remember, there's a lot of effort involved in migrating the content, as we'll see during this webinar. Then you want to identify opportunities for reuse. One of the main purposes of moving to a CCMS is to maximize your reuse. If you're in unstructured content like Word, so the same thing appears in one thousand Word documents, and they've been updated separately over time and there's inconsistencies there. So once you're in the CCMS, you can maximize your reuse. That significantly improves the consistency and accuracy of your content. You do need to take release management into account. So how do you release your content? Are you agile and you update your content every few days? You make a fix, basically address each ticket and then republish? Or are you more of a waterfall type of organization where you may have an official release every six months? You're gonna be working on your content all the time, but you're only going to do incremental releases on a somewhat irregular basis. And you also want to identify your features that you've got in the target system. Because many of those features you can sort of map to at the analysis stage. You don't want to ignore the features of your CCMS, because then you'll bring the content in, it'll be like the unstructured content is now in the CCMS. Okay, but it's still unstructured essentially. It's just maybe in XML, but it's still sort of written and architected in an unstructured way. So understanding the various features that your CCMS has helps you already think about how do I rearchitect the content as I come in. That's good to hear. So there's a lot of work that happens at this stage, it sounds like. And I feel like there might be some people who want to speed things along and skip some steps or get back to it later, perhaps. What would you say is the problem, if any, if we figure out our new information architecture later on after the content analysis stage? I've seen this several times. In fact, when I was a consultant over the years, I've often been brought into projects where they tried that and failed. So the main problem is your unstructured content, if you bring it into a CCMS, for intents and purposes, has the old unstructured or legacy unstructured information architecture. And you can't do much with it. But what's even worse is in order to re architect the content once it's already in the CCMS, is much more work than it would be if you create a core re architected set of content and then build that out using your to be information architecture. So it can be done, many people do do it, but at the end of the day, it slows down the time it takes to actually get to that structured truth, And it also costs you significantly, you know, between two and five times the the expense once you've actually gotten it all done. Wow. Do you have any examples of other, I guess, clients you've had or, you know, situations where you've done this where you've been able to identify things at this analysis stage, and they've successfully fed into the information architecture? Yeah, actually, one that comes to mind is a an insurance company that we migrated to Beligo quite some time ago, but they brought in some new business units, they're bringing in some additional business units right now, and they have a situation where they've got different policies across different insurance groups, And then within each of those groups, they have multiple markets that they cover. But there's a lot of shared content across the policy types or policy groups and the markets. Of course, they translate into twenty plus languages because they were a big company, you know, European market. But what we managed what we did is we did the analysis upfront. And we built a we actually migrated one of the policy types first. And then we brought in a second policy type, or insurance group. And we built on top of that first import. So we didn't actually import in that second set of policies, we modified once we'd re architected everything. After that first import, we then essentially built out the architecture within the CCMS, within Paligo in this case, to then account for that second type of insurance policy. We were maximizing reuse, and I completed their POC a few weeks ago. And what we have is a true single source. So, one publication, one topic of each type, and they cover all of the insurance types, or groups and all of their markets. So, they have this single set of content that they're going to publish to about twenty different documents for each type of policy. But it's all single sourced. So, if they need to make an update to any of those things, any of those clauses in the future, they make the update in one place, and they republish those twenty variants. And there's no chance of error because they only updated in one place. So unless they mess up the one instance that you know, the one occurrence that they're updating, which is highly unlikely. So you've got consistency, they came from Word or they're coming from Word, to be honest. So they've actually got 1000s of Word documents that they now need to update across, you know, twenty plus Word documents, and sometimes they don't update them all, or different people update different ones at different times, and then there's inconsistencies. Now that they're in Perlego, there's no there are no inconsistencies, and their time to market is reduced as well. So it used to take days to make these updates and get them out, you know, reviewed, approved, and translated and and released. Today, it's it's not months. It's it's hours. Yeah. That's a real success story. They achieved single sourcing. They established their structured source of truth. That's great to hear. So we've spoken a lot about content analysis. I'm wondering what comes after we have both inventoried and analyzed our content. What comes after that? The next step is content delivery. So, obviously, you need to take into account how you're going to deliver your content, because the CCMS doesn't exist in a bubble. The CCMS helps you author, organize, and manage the content, in many cases, also localize the content. But the sort of reason you're doing all of this is to get content out and accessible to your users. So in this step, which is the third step, we dive into the content delivery. So what are the output channels and formats that you need? Determine what you need to deliver for each of those outputs. And one thing that I must say is remember to include AI in this step, because more and more companies are feeding their content into AI for agentic AI so that the agents can be asked questions, chatbots and things like that. And you can't just ignore that, you want to take that into account as well. Now features you should consider when it comes to content delivery. Do you need to have multiple versions accessible to your users, multiple versions of your software? Usually hardware just is what it is. The latest version is out in the market, but the software, if you're a cloud company, then usually you just have your latest and greatest out there, that's very easy. If you're a legacy on premise software house, then you probably have to maintain several releases over a period of time. You need to consider the languages and locales that you'll need, the user roles, market segmentation, personalization, any other filtering you need, and then legacy versus future formats. So many companies moving to a CCMS today, they have one output format which they generate from Word or Google Docs, which is PDF. Sometimes the PDFs are printed, if it's a hardware company, there's usually like a printed manual in the box. If it's software, then there'll be PDFs available for download. But now that they're moving to the CCMS, in addition to the PDFs, they want to have some kind of online presence. If it's a software company, they may have a help desk that they're published to, so that's one of the channels. But in addition, they may want some kind of HTML5 help center. Maybe they want to start introducing personalization so that the user logs in and based on who they are and what, you know, which features they've licensed, they only see what they use. Yeah. You used a couple of words. There was channels and there was formats. I hear these often and it's kind of hard to know what the difference is. Could you help us understand the difference between channels and formats? Sure, Josh. Channels are the medium in which the content is delivered. So printed manuals is a channel. PDF for downloads are download is a channel. Help desks, if you deliver your content on a help desk platform, then that's another channel. Formats are the format of the content that sits on that channel. So in the case of printed content, that format would be PDF because use I mean, it could be one of the more older things like latex or something, these days are usually it's usually PDF. Online formats are more and more HTML five with CSS and JavaScript basically added into the mix. Sure. Alright. So once people have determined how they want to deliver their content, they figured out the channels they need, the formats that they want, what would you say comes after that? The next step is mapping the content. And this really depends on the nature of your structured content as to how much you can do. So I'm going to give you a few examples. If you're in Word, so you can map things based on Word styles, you've got paragraph styles and character styles. If you're in an HTML environment, for example, maybe Confluence or one of the help desks, that's HTML based. So then you're limited as to what you can do because you can't invent your own styles. You can add class attributes, so that gives you a bit more flexibility in terms of mapping from the source to the target. And of course, you've got the HTML, you know, h1, h2, those heading tags that you can map to the structure of your target content as well. So essentially, in this step, we want to map the legacy content constructs to those available in the target content model. At a higher level, you want to identify topics and topic types, you may have procedures, you may have reference content and things like that. You determine the structure to which you want to migrate. So in Word, usually in Word or Confluence and those things, it's a flat document really. Whereas once you're in a CCMS, you can burst those large articles into smaller sections or smaller topics, which give you much more flexibility. Different users can work on different topics, for example, if you need sort of multi team member work on a particular project. Right. And also, if you the smaller each of your topics, the more reusable they are, as is. The idea of a topic is each topic should contain one idea or speak to one idea rather than historically, you may have an article that's got some conceptual information, some procedural information, maybe some reference information all in the same article that it's not really reusable to use bits and pieces from there in in other areas, one of the advantages that CCMS provides. What else do we have? Yes, you want to identify things like notes, warnings, perhaps other reasonable content that you can then map across to the target content model. Again, usually that's based on in words that would be based on paragraph styles. And of course, character styles, if you have things like GUI elements, like, you know, buttons, menu items and things like that. It could be code, if there's API code, for example. All of those different things need to be mapped to the target content model. And by mapping, basically to sum it up, it's like I look at what styles I'm using, maybe heading one or heading two, and I'm deciding how that corresponds to the different elements that are available in the CCMS. Does that sound right? Yeah. That's correct. It goes beyond just the the heading elements, but but that's a great example. But if you have styles for different types of notes or warnings, then it would apply to those as well. Sure. What kinds of tools do you use for this kind of mapping work? Is this another spreadsheet thing, or does it ever become more complex than that? Spreadsheet is good for this. You'd have the the legacy construct on the left. So if you're coming from Word, then it's Word styles. If you're coming from some HTML format, it's going to be the the HTML tags, or maybe something like a P with a class with a class attribute value as well. Yeah, you can use Word if you want to have a table with two columns. A spreadsheet is probably a little more manageable. Sure. And speaking of tables, I know those sometimes are something that's a little bit hard to map or to migrate. Are there any special considerations that you have either with tables or with images or other things that, need special attention at this stage? Yeah. Let's start with images, and then I'll get to tables. So images, it depends on the nature of your legacy content. That I very often see people coming from Word, and they've got Word art all over the show. And in addition to that, they've got their images in frames. Now, CCMSs don't actually have access to that. It's very difficult to get that out of the the Doc X format. So one of the prerequisites with WordArt, well, WordArt itself, nobody really replicates outside of Microsoft. Microsoft. So the WordArt needs to be recreated in an image creation tool, and then inserted into the Word document so that you can get that across. Or you just delete them, and then once you're in the CCMS, you can create them outside of the CCMS and import them later if you prefer to take that approach. But even if you just have images that are sitting in frames, you need to have those, remove the frames and embed the images in line with the text. Another thing we see often is the images that are inserted in the Word file, over time, Word kind of corrupts itself, and the image quality goes down, especially if a file is duplicated multiple times. So you save as and then for the next project and save as again for the next project. And after years and years of this, you know, document reuse, if you want to call it that, the file slowly corrupts itself and the images lose quality. So I've had cases where we've imported from Word and the images have been horrendous. And the only solution is to either find the source image if they even exist, sometimes they don't even exist anymore. In which case, you've to recreate them. So that's images. But to bear in mind, it's a one time thing. So even if it's a bit of a pain and hassle, to deal with these images, or if you have WordArts, and now you've got to go and recreate everything in a graphics tool, it's a one time thing. So yeah, it's a bit painful now, but you know, you'll get over it and it's worth the effort, right? It's worth the effort, exactly. And then tables. So tables actually is an anomaly when it comes to structured content, because the whole idea of structured content is to remove the format from the content. So instead of, you know, in an unstructured environment, the average writer spends forty to sixty percent of their time formatting and not writing. In a structured environment like Peligo, the author spends one hundred percent of their time writing. No formatting, that's handled later on at publishing time. But tables don't separate format from content, because tables are inherently format. You've got the need to merge and, you know, split cells, you may want different widths on your cells, on your columns. So it's a bit format intensive. So you can't really completely separate yourself. Peligo does provide ways where you can automate the look and feel of the tables that there's always at least some amount of manhandling or person handling that's required when working on the tables. Sure. That being said, if you have two column tables, both Ditta and DocBook provide a great alternative to tables, for two column tables. In DITA, it's called a definition list or DL, and in DocBook, it's a variable list. And this is a content model or a construct that I highly recommend that you use. It makes the creation and updates of these two column tables, you can still publish them as a table, if you so desire. Online, actually, tables really waste a lot of real estate if it's just a two column table, and the out of the box presentation, both with Ditto and DocBook for online, is not a table, it's much more efficient online viewing experience. Sure. Alright. Yeah, that's a lot of good in-depth knowledge about how to handle these different elements while we bring them over, how to map them properly. So let's say now that we've done the mapping process and we're ready, is it time to migrate now or is there anything else we need to do first? Well, first thing we need to do before we actually migrate, one more step, and that is to plan the migration. So you don't want to roll up your sleeves and dive in and dive into the deep end if you haven't learned how to swim yet. So what we do here is based on the first four steps that we've covered, you're now ready to plan the migration. And here is a strong recommendation of mine, is build your new information architecture out upfront, as I explained for this insurance company. If you don't do it now, it's going to just be a much longer process and a harder process to actually get there. So maybe you just want to do a small POC at this point in time as you're preparing your migration and working on your migration plan. Or you may even do it with what I call a minimal viable content, which is what I did with an insurance company, I brought in one set of content that was representative and something I could use to rework so that I get my initial information architecture. And it's now ready that I can build on top of that. So you know, build out that content, add to the topics. I've just copy pasted from the Word files as I was building that content out for the additional types of policies and stuff like that. So what you're saying is that you start with kind of a small piece and do that first as opposed to trying to get in everything at once, there's real benefits to that piece by piece approach. Exactly. Exactly. Yeah. Perfect. And so with migration and and planning how this will be done, I think a lot of people will feel a bit intimidated by this because maybe they've never done it before. And there is the wonder of should they do it themselves and learn everything they need so they can do it, or should they try to go find somebody who can help them do it? What's your thoughts around if or when you should have somebody else help you with the migration versus learning what you need so that you can do it yourself? So my recommendation is you always want to migrate some of the content yourself. That gets you up and running in the new CCMS, the new environment, working with the XML, working with the new architecture. If you have a huge content set, and you really do need to bring everything in, like, you know, if you're a hardware company that's been in business for like sixty years, and you just have a vast, you know, set of content that you're still updating all the time, then you may not want to bring it in all yourself unless you have, you know, interns or something like that that you can leverage. But still, you can certainly hire a third party to do the migration, but you should still keep back at least a small piece of, you know, part of the contents that you can migrate yourself. The experience and the learning is much better than if you farm everything off to a third party. And we do work with a number of partners that handle migrations for our clients. But even then, I always recommend that the customers do hold back some of the content from the partner so that they can migrate it themselves. Because otherwise what happens, and I've seen this, in my early days at Peligo when I wasn't able to give that advice to incoming customers. So I came in after the third party consultant was already working on the migration. So they found out the entire migration, and now the migration was finished. It was handed over to the customer. They did some basic training on the information architecture of the content. They did a bit of, I always had a few meetings and discussions with our customer when coming up with the information architecture. But when I started working with them, the information architecture wasn't really suitable. I updated their partner as well and they, you know, it was learning on both sides. It was one of their first BELIGO migrations, to be But what often happens is the IA may not be perfect or ideal. The consultant may not have taken everything into account, they usually do. But they only have the information based on what the customer tells them. Very often the customer doesn't know what to tell them, right? So they're missing information which leads to some gaps. But the main challenge that I saw from this particular one was the time it took them to learn how to use Perlego, how to get their sort of head around the information architecture. And, you know, just how to get into structured authoring, they went through the training and everything. But when it came to applying that to their content, their content was so different from what it was in, in the legacy system that they couldn't recognize anything. Whereas had they at least done some of the migrating themselves, they would have known where to start, you know, things would be familiar because they would have done this, this sort of move to the new information architecture themselves, not just having it sort of handed over to them at the end. Yeah, it makes sense to definitely put in the effort at the beginning to learn what you need to recognize your content. And so once it's transformed, you're not totally lost. One thing I wonder is sometimes we might at this stage think that the migration is so difficult and so complex. Are there ever times where you get the sense that it's easier to just recreate the content fresh from within the CCMS as opposed to migration? Like, how would you recognize those kinds of situations? My first recommendation to any customer is if you can afford to start fresh and build a clean information architecture, that's always best. They often think they haven't got time to do that. I personally think that it takes them less time, it would take less time to do that than to migrate and then have this like hybrid situation. But in any case, be that as it may, if you can start with a clean sheet, and you've got access to, you know, a consultant who is a good senior information architect and content strategist, you're best off doing that because you can recreate from scratch a far superior architecture than you can get by taking your legacy content, tweaking it a little bit, and then bringing it in. But that being said, the approach that we use for the insurance company where you migrate a small smallish representative set of content, get it into the CCMS, rework it. I mean, we did heavy lifting to rework it, insert variables, apply profiling and get all of the variability and reuse set up. That wasn't like a sort of five second thing, but it was time well invested. And then you have basically, you can think of it as building or preparing the foundation of a house. So once you've got that foundation, you can then build the first floor or ground floor, which is that representative content that you brought across and reworked. And now your additional floors on the house, like the second, third and fourth floor, if you have such a large building, is those are the variants or variations of the content that you're building out on top of that foundation rather than just copying in and now you've got like four houses side by side. Right. Each with a different foundation kind of thing. That makes sense. Well, let's say that the planning is now complete within this process. Are we ready to migrate now? What comes next, would you say? One last thing that I want to mention about planning before I get to that obviously, during the planning, want to determine the effort. So you should migrate one representative piece of content or a small number of representative pieces of content if you have different sort of groups, and then just multiply by the number of documents you've got. That'll be way more time it's ever going to take, but it gives you a thumb suck estimate as to how long it's going to take. Okay, then the next step is actually the three migration steps. So, first of those, which is step number six, is the pre migration work in the legacy content. Now, if you're coming from Word, that's going to be dealing with images, like we said previously, if you've done local formatting in Word, so you don't use styles, well, you just make things bold, make things green, you know, instead of using word styles, you need to go back and apply the word styles because you want to map from your word styles, both paragraph and character styles to your XML content model. Otherwise, it's it's gonna come in fairly flat, and you're gonna have a lot of work in the CCMS to now go and and apply the correct XML elements to those. Yeah. So in other words, if all I've been doing is highlighting paragraphs and bumping up the font or putting on a color, but I didn't actually use the styles, that's that's kind of the situation you're talking about. Yes. Exactly. Now if you're in an HTML based unstructured environment, for example, confluence of all those help desk platforms, most of them give you proprietary extensions that you can use. Now, CCMS is don't aren't coded for those proprietary extensions. So sometimes you need to take them out, you may need to change your HTML coding. So for example, by using by replacing whatever those extensions were with standard HTML elements and class attributes to identify things like notes, warnings and things like that. And you may even have, and dare I say, probably the most difficult format to migrate out of is InDesign, Adobe InDesign, because that's a high design centric environment. So if you're in InDesign, what you'll probably have to do is either publish to HTML or to PDF and then convert the PDF into Word. And then you can go and apply the styles like we said, either the HTML retagging or the D Word style application. But that's the worst probably about the worst situation you can come from. Because it's so visual. Right? There's no, like, semantics in the way that you get with HTML so much or Word. Right. So that's sort of the pre migration work. You also want so while you're working out your pre migration and you want to document all of these cleanup things that you need to do, so if you start working on new content that you haven't actually migrated yet, you may find new things in the source files that you haven't considered. So, if you find these things, you go back and figure out, okay, how do I map these, you know, and then update your documentation accordingly, so that moving forward you'll take these surprises into account and there won't be surprises anymore. So I suppose after the pre migration work, then we're ready for the migration work. Does that seem right? Yeah. Perfect. And so what kinds of considerations should we have at this stage of when we're actually finally doing the migration? Yeah. So the migration itself is usually fairly quick and straightforward. Most CCMSs provide an import function. So you just import that content that you have cleaned up, and it comes in. Now, there is a best practice here that I'd like to share, which is to what I call just in time migration. I haven't mentioned this yet. But instead of planning to migrate your entire content set, which we've already spoken about, let's say that half your content set won't be migrated, you've already addressed that, right, at the audit phase. But you've now got this other block of content that does need to be migrated. But you probably don't need all the content migrated right now. So, what I recommend is an approach called just in time migration. So migrate what you need for the next month or two, because the learnings from that migration can then be applied to the next batch of migration. So rather than doing those huge migration efforts of your entire content set that needs to come in, you migrate smaller blocks as you need, and you can tweak your migration process further as you go through these iterations. So after you import the content, what would be the first thing that you do at that point? Once it's been imported, you want to check the logs. So most CCMS is when they import content, they log errors and warnings. You need to pay more attention to the errors that identifies things like content that's been dropped. And then depending on the CCMS, there may be errors or maybe warnings, but it can identify cross references that are broken that need to be fixed afterwards. The target may not have been part of this document, and therefore, it can't be resolved. Now, some warnings can be ignored, but you should figure that out as you go, so that at least identify, okay, these warnings I don't need to, I don't care about. But the ones that identify lost or corrupt content coming in should be accounted for. Okay. So then I think at that point we're probably into the post migration cleanup. Are we done at this point or are there any other last considerations that you would have? Well, the post migration cleanup, this is where you need to go and really re architect or finish architecting your content. So if you're coming in from unstructured, you're not going to have things like variables and profiling attributes set up yet. So, you want to go through your content. Also, in many cases, the mapping from the legacy unstructured to structured is a one to many mapping. So take for example, notes, you may just have a note tag in Word, but note style, should I say, but in, in DocBook and Ditto, there's various types of ammunitions. So, you would map the most used one, and then the others you would have to come and rework once you're in. So, you need to go through, clean up the XML itself by applying the correct contextual elements to things. Very often you have to change buttons that may be bold into, you know, GUI button or GUI menu, things like that. Right. You need to apply variables, you need to apply profiling and things like that. Depending on the CCMS, there may be some additional things that you need to do. And again, as you go through this post migration cleanup, and you check your content, you're going to find surprises. So, you know, investigate the surprises and update your migration plan documentation accordingly. The last thing that I'll recommend here is you should publish your content. And then check, are the tables of contents correct, or are there corruptions that have happened? Does the content appear be there, or are there pieces that are missing? And if you're if you're coming into Perlego, we have a nice feature that broken references, so cross references that are broken, they appear as three question marks in HTML five outputs. So I would publish to HTML five, and you can just search that for three question marks, and you can identify broken links. That's an easy way to identify broken links. Now at this stage, are you thinking about translation or localization at all? How much does that factor into this migration at the end? Yeah, so we've got the eight steps that we've been through. And localization is an important consideration. And it applies across all eight steps. I don't have it as one of the eight steps because it applies throughout. You need to take your current localization workflows and pain points into account. You need to consider what are your future localization needs. Even if you don't localize today, you should at least have a thought about localization because somebody may come knocking on your door and say, okay, we need German or we need Italian versions of our content. So you want to be ready for that. You don't want it to be, oh, I don't know anything about that. So you want to take it into account ahead of time. Now, once you're coming into the CCMS, you've got options that you didn't have before. So in the olden days, in the legacy days, you would probably just send your Word file out to a localization agency. They love that because they charge you quite a lot just to, besides translating the text, obviously, but they also have to redo the images, And they charge a fee for each image. But now that you're in XML in a CCMS, their integrations with translation management systems, their integrations with or their built in AI translation services. So depending on your needs, you may want to do, you know, rethink your translation strategy. Now, if you're in a regulated industry, like, you know, pharmaceutical, you have to have human translation. But even then, you may want to do sort of AI translation first and feed that then into humans. Many companies have started doing that. Or for content that you don't care about, because it's outside of the regulatory compliance area, you do AI translation for that, because that's quick and cheap, comparatively. And then for the compliance information that has to be accurate, and, you know, you don't want to get litigation around that content, you would use human translation of the full translation memory and, you know, translation management system with that as well. The the CC message integrates with multiple translation workflows, and you can really achieve some savings there as well. Definitely. So I think at this point, we're we're at the end of the migration. Right? There's nothing else that I can think of. No. There's nothing else except to enjoy life as a structured content, author, because now you can focus on the content. You don't have to focus on the formatting. That's awesome. Thanks so much, Gershon, for walking us through these eight steps to migrate your content from those legacy unstructured, formats into something that's structured and and ready to be the, the source of truth for AI in your enterprise. You're most welcome, Josh. Absolutely. Thank you, Gershon, and thank you, Josh, as well. So now we'll take a look at the questions submitted and try to get through them. Any we don't get to, we'll follow-up directly in the next couple of days. So we're currently sitting on sixteen questions, so we'll do our best to power through. If there's some overlap between some, I'll consolidate them together. So we shall just start from the top and power our way down. So first question from Maggie is a very specific question. Is there out of the box migration path for importing content from Sphinx into Paligo? No, there isn't. That's not one of the formats that we support out of the box. Alright. Easily answered. Let's see. Next one. Add to stitch. So from Charles. Now this was from earlier in the discussion. If you no longer have access to the system where your content is being migrated from, it's lost. So I would say migrate everything, then archive the things you don't need. So Gershon, how would you balance the debate here? Yeah. So we've actually got I'm in the process currently of migrating content for a customer that is in this boat. Actually, got two legacy systems that they used before they came to Perlego. Now, what you want to do is you want to export your content out of the system before your license expires, which is what they did. So the first platform or solution they used, they exported the content, and it just sits on a file server in their intranet. And whenever a project comes in, where it's a customer, like an legacy customer, they're now moving to their current release, and all of their releases happen to be customized for the users for the for each customer, it's that type of project. So they will then migrate that ancient stuff. It is HTML based. So from that perspective, it's okay. I've migrated two of them so far, and the first one was okay, the second one was a nightmare, but because they archived it out of that system, they were available to them that we could then migrate it into Perlego. The second system in their case actually is another CCMS, a competing CCMS that they decided they want to leave. And that is they're going to be terminating their licenses with that vendor in the next few months. And in that case, they've made a business decision to migrate everything into Perlego because those are all active customers. So it's quite a large project to migrate all of that content out because we're not doing any reuse. Because for various reasons, mainly based on the architecture of the legacy system, that legacy system that they're moving out of, they just do their branching essentially just copies and creates a duplicate of everything. So we're bringing all those duplicates in. They have started doing some single sourcing stuff, as content has been coming into Pentagon. So they're taking advantage of that. But I'm bringing that customer because it's a good example of, on the one hand, get everything out of the old system because, you know, you certainly don't want to lose it, but then it sits there until you may or may not need it, and you can always bring it in just in time. But if you have content that you actually absolutely do need to update because they're active customers, you know, active documents, then you can't leave anything in the system you're moving out of because once your license expires with whatever the SLA thirty days later, they basically blow up the instance, right? So again, my recommendation is get it out of the old system, because you're going to lose access to it, but don't bring it into the new system until you're sure you need it. Very balanced response. I will go on to the next one, which is oh, excuse me. I have repeated that. From Miriam, we have over a thousand documents written in InDesign, and we have been publishing to PDF, which we import in Paligo. Are you saying it would be easier to publish an HTML file and then migrate? You may lose too much if you go to HTML. Our preferred route is to go to PDF and then Word, which is what you're doing. We have had customers publish to HTML and bring it in that way. And then I guess they have additional cleanup they do either in HTML or in in Perlego. You may want to try it and see if it works for you. Oh, Josh, did you have a perspective on that one? No. I think I agree. It's not necessarily that one might be better, but one might just be more appropriate given your situation. So it's worth at least trying both coming from Word or coming from HTML when you import into Pligo and just see what works best in your case. Yep. Absolutely. Alright. From Ralph, in case of graphics in DOCX files, can you simply extract them by unstructuring the word file with a zip tool and search for the graphics folder? Not that easy. But if you have access, or if you know VBA, or whatever it is they have these days, I guess it's just VB without the A. Back in the day of VBA, I used to write macros that would prepare the Word file for a customer or for me. When they moved to the Visual Basic, I was added managed to do some VB development that did some of that stuff. I haven't worked in Word, thankfully, for more than a decade. So I've no idea what their, you know, VB capabilities are. But assuming that you've got access to the constructs, you can automate, you can write an automation within Visual Basic, or I suppose C sharp, for that matter, that searches for each image, or searches for each frame, and then checks what's inside the frame. If it's an image, there's code that essentially embeds the image. The problem with that is it embeds the image in a weird location, because wherever the anchor is, is where the image ends up being. So you need some additional code that identifies where you probably want the image to be, you need some logic just to try and put the image into the appropriate place. If you happen to have figure captions, then it's then that logic is fairly straightforward. If you don't have figure captions, then you may just decide, you know, let me just get the image dumped in somewhere in the Word file. And and after that's been done, you can just go through the Word file manually and move the image to where you want it. Okay. Okay. Oh, from Tiliana, do you think it is a good idea to structure content topics in the CCMS around the main content types according to the DITA concept task and reference? I mean, applying a more semantic based rather than product centric approach. Thank you. Generally, the answer is yes. You don't have to use the data three concept reference and task, though obviously, there's nothing wrong with them. I've worked with a lot of data systems and notebook systems. And I don't see them as competitive, by the way, I see them as two different approaches to solving a similar problem. The one thing I do want to say though, is you want your topics to talk to one idea. So whatever that idea is, and then depending on your content as well, you may require some additional topic types. So if I take the digital definition of a topic type, you may have things like glossary terms, which you want to manage as a different kind of topic type or content type. So each glossary term and the acronyms and all the other stuff that goes with it will be treated as a different type of topic. You may have, if any of you are aware of the precision content writing methodology, they separate out tasks from processes. So a task is something that a user that a human does in order to achieve something, whereas a process describes how a system works. So those are also two different topic types. So in that precision content environment, even when it's being applied in Paligo, you'd have task topics and process topics. So you can have any kind, you know, any number of these types of topics dependent on your writing methodology, dependent on your corporate styles, or rules and guidelines, and sometimes even certain types of industries have additional types of topics that they just need because of the nature of the industry. Yeah. I think it's good to realize that the structure of your content can go beyond just which tags are allowed in which place, but it speaks to the way it's actually written at a fundamental level. So that kind of structure, I think, is always useful no matter which CCMS you're using. Yep. From Gabriel or Gabriel, what experience do you have with migrating Adobe InDesign documents? So it's a broad, but we can do it, I believe. Yeah, so what experience? Awful experience, that was just a So we have partners who specialize in InDesign migration. But to be more serious, it really depends on the nature of your InDesign documents. I supported a number of customers over the past couple of years now, since I joined Paligo, that came from InDesign, some of them had used InDesign for their user guides. And they were not much different from user guides that are developed in Microsoft Word. So those are much easier to migrate, then I have a different customer that had what do you call them, brochures, including specifications of their, it's a hardware company. So, their spec sheets, their spec sheets were also done in InDesign with a very high design. It was like high design, low content. And we were unable to replicate those in Paligo, but we came pretty close. The migration, however, they basically did copy paste over a period of several months as they needed to migrate those data sheets over. So you know, the short answer is, you have a lot of InDesign content, we've got partners that specialize in migrating from InDesign to CCMS, not just Beligo, CCMS in general. We do also sometimes do InDesign migrations ourselves. Our professional services team have sometimes taken on InDesign projects. We also have customers that do it themselves. So, know, happy to have a conversation. I'd really need to see your InDesign documents to see exactly what how complex they are. Yep. So I just wanted to check, are both of you guys okay to answer a couple more questions or shall we make yep. Alright. We'll keep going while we're here. There's there's no there's no nothing like getting a live answer. Okay. From Wendy, can you use alt text to map images? We use so out of the box migration from formats that support alt text, the alt text comes in as the alt text in paligo. So it's actually attached to the image as the image description. And at publishing time that becomes the alt text in the published HTML. Can you remap that to do something else? In a custom migration, it's potentially possible. But you may, I mean, you may not want to do that. Because you don't want to lose the alt text if it, you know, if it's currently alt text, you want to preserve the alt text for accessibility purposes. Absolutely. Okay, so next from Birgit. What system do you recommend for structuring documents in small topics, example, diataxis or IIRDS? I'm not familiar with diataxis, but IIRDS is more a metadata or taxonomy model. I actually have a working IR DS implementation in my Paligo instance, which I have implemented. I created the content. Actually, the content was migrated from a reference content that is a reference model of IIRDS. So that's in Word, I imported the Word, and then I reworked everything in Paligo to apply the appropriate metadata and taxonomies to the content. So it wasn't really a system that did the restructuring for me. I imported it from Word into Polygon using the standard import, and then I re architected the content inside Polygon. Alright. So I think this is repeated by another person, but I'll double check. The content analysis to find reusable content on thousands of Word docs, for example, would be a huge manual effort. What tools does Poligo have to enable this? So finding opportunities for content reuse, I I would say. Yeah, so actually, one of our fellow IAs has had good success with using AI, for example, ChatGPT. So they feed, you may want to use something that is, know, a business account, not something that you're sharing, know, your content publicly with. But if you feed it into something like, you know, chattypt, Claude or even Gemini, and you give it the appropriate context and you ask the appropriate question, we've seen good results come out from these initiatives. And this is the part where we say, unless you do not have permission to use such tools. Right. But if you do, or if you do have access, yes, very, very useful. So that was very similar to, I'm just going to briefly put it on stage from Miriam identifying reusable content and avoiding duplicates, same answer really. Yeah I will give you a plug, You may have noticed that Paligo is doing a lot of investment currently around AI. And one of the things that we have planned, it's not currently in development, but it's something that is planned, is to crawl your content in the Paligo repository and suggest opportunities for reuse, and even essentially guide you through those changes. So, you don't need to go and say, Oh, okay, this is what you found, now I'm going to make the change myself. But Perlego is actually going to make the changes or suggest the changes, which you can then tweak or then say go ahead. So that's once the content already has already come into Paligo. I think a natural extension of that at some point in time would be to perform that analysis outside of Palico before the content comes in, and then have a sort of AI powered import tool that would then apply those findings at the time of import. So that import one is more sort of, you know, pie in the sky, I have no idea if there's any, you know, what the chance of that is of being done, but the one that actually sort of retroactively, periodically looks at the repo, that is something that we've got planned. How can AI be effectively involved in a Microsoft Word to Polygon migration to minimize manual effort? A rather broad question. Yeah. But this is a nice question too because we've seen actually, this may be a good one to try with your InDesign content as well. He did a follow-up comment, excuse me, I'll just give the full context. He's considering using AI to simplify MS Word documents, convert them to markdown and then import them to Paligo and how is that approach? Sorry, Candida. Yeah, okay. By converting them to markdown, you're probably going to lose some richness that you already have in the Word, not that Word is that rich. I'm assuming that you're applying, you know, extensive paragraph and character styles, assuming you've got custom templates in Word. If that's the case, then you may lose too much going into markdown. We have a very good Word migration tool anyway, and we can do custom word migrations that where we can map custom styles into DocBook elements and attributes. But what I would recommend if you want to use AI, you can actually ask AI or use AI to convert your Microsoft Word files into DocBook five dot one or five dot two or even DocBook five dot zero XML. And then you can import the DocBook directly into into the legal. I can chime in a little bit. I did some things very similar to this, not with Word necessarily, but with PDF. I just gave it to ChatGPT and said, turn this to DocBook five and imported that. And it worked very well. I was surprised, so I'm sure it would also work well with Word. Yeah. Thanks, Josh. It should work with Word, and I'd even suggest maybe trying that with InDesign. The the question earlier that was asking how to get InDesign files in. You may need to go through a few rounds, because it's InDesign. And it may actually create some invalid DocBook, you'd want to validate that that DocBook before you feed it into, into paligo. So you may need to tell it don't do this, do that, you may need to sort of do some handholding. But that's a route that should be that may actually work out fairly well from InDesign. I found it worked well with finding the admonitions. If I did the normal word import, it might just be a regular paragraph with the word note at the beginning. But when I went the AI to DocBook route and actually put it into the proper element, so it saved a little bit of effort. Yeah. That's cool. Because that saves requiring a custom professional services import project To map those styles into DocBook. So, you know, ChatGPT saved the need for a custom import and gave it to you without too much effort. Thank you very much. So we're gonna just have the wrap up here simply because we're running out of time and then the remaining questions will follow-up over email. So just from both of you, a couple of light, when it comes to migration, sharing some real life example of do's and don'ts. Josh, you wanna go first on this? Sure. Yeah. I'll think. I I think maybe the the mistake people will make is to assume that it will happen in in just one pass or it'll be very easy from the get go when probably there will be a need for multiple iterations of things. You'll import something and realize maybe there are further cleanups I need to make at the source to make this better, and then you'll go back and iterate on it. So I think that would be a good approach to take that I've learned from experience. Cool. So one do that I'll share is the more preparation work and the more analysis work you do upfront, the better the outcome. So that's a do and a don't is I have to say this one, don't re what I mean is don't bring your content into the CCMS as is, and then plan to re architect in the future, because that future never comes. We have a customer that they were already in Polygo when I joined. We did an IA workshop with them. And they knew that they would have to make all these changes down the line. And it's been several years, they have made some changes, but they have to basically block like three to four weeks of okay, now we're doing this rearchitecture thing on all the content. And then after that, they'll go back to business as usual on time until like the next year in summer, they can, you know, close down workers business as usual for a month to do the next step. So you know, and they, that's a good customer, because they have actually found a rainy day that they can do this on. Many customers, and I've seen this at other companies I've worked with as well, the rainy day never comes. True. Once you get on the bicycle, hard to get off. Then Josh will get the closing slide up, if you don't mind. I just want to thank everybody for joining us, and I hope you enjoyed this session. And you'll see the recording in the next couple of days. Don't forget to go to paligo dot net to learn more about us and how Paligo will become the platform for structured truth for enterprise AI. Goodbye, and take care for the rest of the week. And thank you, Josh and Gershon. Thank you. Thanks, everyone.
Apr 29, 2026
