Category Archives: Watson

Content Management Options for IBM Watson Conversation Service

The Reality

Things change very quickly in the cognitive world. The provided information is as current as of the last update date. It is very possible that, like everything, things can change. The options provided here are some of the possible options, but not limited to.

The Purpose

The topic discussed here is to provide some options of how to manage your content in Watson Conversation service.

The Considerations

Many factors such as the User Scenario, Use Cases, Requirements and Data should be used to determine what approach is taken. Some examples of things to think about regarding your Data are “Is the answer data structured or unstructured?”, “Are the answers dynamic and context-dependent?”, “How large is the answer data (units)?”

And now, here are some options…

Option 1 – WCS Provided User Interface

This option is to use the out of the box tooling provided with the Watson Conversation Service for all your content updates.

Pros:

  • You can begin making content updates immediately (no additional development effort for a content management solution)
  • No additional tooling required
  • No additional cost
  • No additional components – all content managed within WCS

Cons:

  • Possibility that it can get tedious to manage, depending on deployment type

Option 2 – Leverage WCS APIs

This option is to use a simple data source, such as a spreadsheet, to manage and house your content and then use that to update your WCS workspace leveraging the WCS APIs.

Pros:

  • Easier to manage content updates
  • No additional orchestration of answer retrieval needed
  • No complex database needed

Cons:

  • Requires development of additional component (custom code & spreadsheet configuration)
  • Maintenance – Like anything with custom code, you will need to maintain the code any time the leveraged APIs change
  • Error prone when using a spreadsheet (identifying error may be challenging)

Note: There are various different ways that this solution could be implemented, such as using a web application, that I will not dive into here.

Option 3 – Leverage a Database

This option would involve creating a custom web application to manage content updates. This option also requires a custom database to store all the WCS data, including answer data. This would require creation of some additional components to make this work such as a service orchestration engine and UI.

Pros:

  • Easier to manage content updates
  • Less error prone than using Excel  (with controlled web app)
  • WCS API updates would not affect the custom web app & database which manage the answers as those are independent of the WCS APIs (however, standard changes in the application orchestration layer will apply)

Cons:

  • Requires cost, development & management of additional additional components (Web App, Orchestrator & DB)

DB Considerations (some examples):

  • Type
  • Sizing
  • Regional hosting options
  • Cost

Note: There are various different ways that this solution could be implemented, but this should provide some context for different ideas.

The Conclusion

There is no one specific solution that is the best fit for every scenario. Each project will require careful consideration of the items discussed in this blog post and other factors that may not have been mentioned here that come into play. There are likely other options out there, but hopefully this has helped get you thinking about the possibilities. The key is to consider all pros and cons along with the hosting environment, team skills, and more, to make a best decision.

How to Migrate Watson Conversation service from Test to Production

You’ve created some intents, entities, answers, dialog flows. You have hopefully gone through some iterative testing cycles to validate. Now, you want to promote those changes from your Test to Production environment, so what do you do?

Promoting Your Entire Workspace

Your Watson Conversation Service (WCS) configuration, which is your intents, entities, dialogs, etc are all stored in a workspace. When you want to do a complete migration or promotion from your Test to Production environment, you want to update your Production workspace using the WCS ‘updateWorkspace‘ API. This will take your JSON export from the Test workspace and update your Production workspace with everything it contains.

The benefit to using this method is that the ‘workspace_id’ will remain the same so any external component relying on that ‘workspace_id’ will not require change.

Other Options to Move Workspace Data

Another option available to move or share workspaces, is to use the WCS tooling to export your Test workspace and import it into your Production environment. The challenge with this is that WCS will assign it a new/different ‘workspace_id’ so any existing external components relying on that will need to be updated. So be careful here.

If you only made changes to your intents or entities and you only want to promote those specific changes, you can leverage the respective APIs which are documented here to allow you to do so.

Using any of these approaches, please note that you still will need to consider some time to account for the training of the workspace updates to complete.

How to Optimize Your Watson Chat Bot Application in Production

You’ve gone live into Production with Watson, great! So, the question is, what do you do now? How do you measure your deployment? How do you improve the deployment? I’d like to talk about some best practices and recommendations on how to optimize your Watson deployment once its in production. These best practices are being based on my experiences with these Watson solutions: WEA (Watson Engagement Advisor), Dialog/NLC (Natural Language Classifier) and the Watson Conversation service available on IBM Bluemix. However, these principles can be applied to other Watson services as well.

Qualitative Analysis

I just want to be clear that the approach I am speaking of below is one measurement that is specific to improving and optimizing the question-to-answer level quality (eg fine-tuning intents, see which questions are doing well/not well). To determine the true value of any cognitive chat solution, you need to look at the overall conversation level to measure if the chat provided value (To keep this focused, short and sweet, I will not go into the conversation level detail here).

One of the keys to optimizing Watson is consistent qualitative analysis (in this case, at the question-to-answer level). This is validation of the quality of Watson’s performance on a regular and consistent basis. You can setup the right frequency and numbers that work for your organization based on maturity of the deployment, deployment size, volume of activity (eg questions being asked) and resources available, but I will provide an example.

For your first time doing this, you can pull ~500 questions from your production deployment’s conversation logs (which you would do on a regular basis, such as every two weeks or less depending on sample size and available headcount) and analyze the questions asked along with the responses provided. You should create a rating scale on how you rate the responses based on correctness. For example:

4 – Perfect answer
3 – Acceptable answer
2 – Not an acceptable answer
1 – Completely unacceptable or ridiculous answer

Your goal is to assess what answers were acceptable/correct and which answers were not, which leads to what questions require attention. Once you’ve rated all 500 of those questions/responses, add up the number of 3s and 4s as “correct” and add up the 1s and 2s as “not correct” and that percentage becomes your baseline number for correctness. Keep this baseline number handy as you will need it. Now, you’re ready to take some action.

Take Action

I highly suggest that you look for trends of question intents where Watson is not performing well. I would put much less emphasis on “one-off” types of questions that Watson did not respond well to (unless it’s a critical question to the business or a liability topic). You want to focus on what the majority of your end users are asking to have the largest impact, so again, look for trends.

Once you identify a trend, you can begin to look at ways to enhance the question intent. There are many corrective actions that you could take but it all depends on your analysis of the intent. Maybe you need to add more variations to the intent to expand Watson’s understanding of how this question could be asked or phrased? Maybe you need to adjust your intent groups so that Watson can better delineate between them? I have another blog post that talks about Best Practices for Training Watson on Intents that I suggest you read to help you.

So up until now, you’ve taken a question sampling, rated that sampling on correctness, analyzed for trends and taken some corrective action. So what is next?

Iterate!

Once it’s been two weeks (or whatever your decided frequency is) since your initial sampling of questions, pull another sample of 500 questions (the frequency of how often you iterate will depend on your sample size of questions and available headcount, so this can vary). Another important thing here is to pull the same number of questions and analyze/rate them the same exact way. You also want to have the same person(s) doing this work. This is to ensure consistency and continuity across each iterative cycle. It is also important to ensure you have given enough time in between cycles, after the latest update has been pushed in production, to allow for a variety of questions to be received which helps identify trends.

Now once you’ve run through the cycle again (and each time thereafter), there are several key things you should be looking at:

1) Compare the correctness rating with the previous cycle. Are you doing better, worse or the same?
Note: You are looking for incremental improvement, so a consistent incline of even half a percent is positive progress.

2) Compare the trends of intents you identified that are not doing well with the previous cycle to see if you still have room for improvement on the same intent. Are you not performing well in the same intent(s) over and over? If so, then you need to seek Expert Advice to look into what is going wrong.

Evaluate

There is no “one size fits all” guideline as to how long one should iterate on the qualitative analysis cycles to improve their Watson deployment. There are many factors that vary from deployment to deployment such as:

  • Deployment size – The number of intents/topics covered in the deployment
  • If new intents/topics are added to the deployment
  • If requirements change
  • If new user groups/organizations are provided access to the deployment

As you evaluate Watson’s performance over time, you should compare your rating numbers and observe the trend. Eventually, you will hit the point of diminishing returns (unless you continue to add new intents/topics into the solution). You will need to make the judgement call of when you hit this point.

Feedback

One other helpful aspect you could analyze is end user feedback. I have often seen many chat applications implement a thumbs up/thumbs down rating per each content response provided by Watson. This feedback could be misleading as the end user could have given a response a thumbs down, not because it was not the right answer, but because they perhaps just did not like the answer. I highly suggest looking for trends of questions with answers rated as thumbs down. If there are a group of people all rating a specific question intent’s response as thumbs down, then I suggest having a closer look to see what corrective action may be needed. Again, focus on trends, versus a “one-off”.

The End

Optimizing Watson is not always an easy task, but following this qualitative analysis prescription consistently as described should help provide some guidance in making the task a bit easier.

Best Practices & Lessons Learned for Training Watson on Intents using Conversation Service (or Natural Language Classifier)

How to properly train Watson to understand intents is a very popular topic. This can be done using the new Watson Conversation Service (or the Natural Language Classifier service) on the IBM Bluemix platform. I have been asked on a weekly, and sometimes daily, basis on how this is best done. In an effort to help share my experience and expertise in working with Watson over the last few years, I will provide some guidance for beginning and improving your training of intents. The guidance below is presented with the intent of training a use case for a conversational bot or chat assistant, but most of the concepts can be applied to other use cases.

Reality

The fact is that there are several general best practices that one could follow but every Watson deployment will be unique and require some unique tweaking. Some practices you may not need to follow, some you may need more of and some things you will do different. This is cognitive and this is reality. You will need to think innovatively of how you can take these core practices and leverage them in your deployment.

Best Practices

Use representative end user questions – Assuming a conversational assistant scenario, these are questions that are captured from end users themselves, in their own words and terminology. This may be a bit more challenging for some organizations to obtain. Some organizations have an existing web chat or online support service that they can scrub representative questions from to use as a starting point. For those organizations that do not have this type of data available, there are other processes that can be implemented to obtain end user questions, such as polling real end users through an existing system or through a new one.

How many variations to train each intent? – Usually, the more diverse the representative question variations you obtain are, the more thoroughly you can train your intent. There is no magic number. To properly train Watson on an intent, some intents require 5 variations, some may require more than 50. For optimal results, I would target to start with 10 variations and then assess it based on your test results (a later topic). I’m not saying you need 300 variations per intent, which seems quite overboard, but 10 is a general guideline I have used in my experience that seemed to be a fair starting point.

This will also depend on how many intents you have in your deployment. In cases where you have a low number of intents, such as 12 intents total, you may not need to provide more than 10 variations to achieve high confidence if the intents are very distinct.

Variations – this one has been a huge topic when training Watson. Using Watson Engagement Advisor in the days before the Conversation service/NLC, the guidance that I had followed successfully, was always to use representative questions as variations. I always followed that approach for that technology and it was proven successful.

The same approach should be followed for the Conversation service/NLC. The best practice is to collect representative questions from the actual end users, identify primary intents and use the remaining like-intent questions as variations to train Watson.

I’ve been asked “Can I make up some of my own variations to add or do those all have to come from the actual end users?” The usual guidance is that we should not be self-creating variations, specifically in conversational assistant scenarios. In working with the Conversation service/NLC, I’ve found that there may be, at times, an exception to this case. In some rare circumstances, I have not been able to get Watson to understand an intent with a high confidence level, even after adding many representative question variations. Folks, this is cognitive and there very well may be clear justification somewhere deep in Watson’s algorithmic mind, but it may not be obvious to the human eye/mind. There have been cases where I was left no choice but to provide a self-generated variation to help Watson learn better.

I want to be crystal clear that this approach should be used as a last resort and handled with care, especially when dealing with conversational bots. I do not recommend starting with this approach or overusing this approach. I stand by the approach that everyone should be using representative end user questions as their foundation for intents and variations. Intents should never be fabricated, ever. Variations can be supplemented with self-created variations on a very rare and careful basis. If you self-create all your intents, you will likely experience sub-par performance for trying to fabricate the end user speak. If you find that you need to constantly add in self-created variations to properly train, then you need to revisit your collected representative questions as there is something that is really wrong.

Structuring Intents – There are multiple ways you could structure your intents, but I’ll talk about two common methods. It is very popular and common that you will use a combination of these in your deployment.

One way is to group intents that do not contain various different entities embedded in each intent. Here is an example of some questions of an intent designed to ask about general team performance:

Can you tell me how my team is performing?
How is my team doing?
Are my team members performing well?

The other way method to structure your intents is to group questions which include multiple entities embedded in them. This is when you have one high level intent, but little sub intents within them. You would group them all together as one intent to give Watson a stronger confidence level of the high level intent, and then in Conversation/Dialog, you would use entity extraction to pull out what the sub intent is. Here is an example:

Can you tell me how my team is performing?
Can you show me performance across all teams?
How is the team performing in the US?

As you can see, all three of these questions are asking about team performance but focusing on 3 different aspects of it. Depending on your case, you could train Watson on all of these as one intent called “Team_Performance” and then use entity extraction to examine the user utterance for the sub intents (eg “across all teams” or “in the US”).

*Technically, there is another possibility where you could identify sub-intents by using multiple classifiers/intent services but that would require additional effort and configuration at the orchestration layer that I will not go into here.

Iterative Training & Testing Cycles – This is key to the Watson learning process. The sooner you can put your Watson deployment in front of real end users, the higher quality your deployment will be when you go live. Again, in some organizations, this may be easier to achieve, such as if your end users are internal employees. However, this may be more challenging for organizations where their end users are external customers, but still achievable.

I would use the same number of questions per test session and craft the sessions around the same topics (to ensure the test questions are targeted against the use case intents). Once you’ve achieved satisfactory performance, you can move on to other topics. After your first test session, capture your results that include the percentage of questions that returned the proper intent. This will become your baseline. Once you’ve got your baseline, thoroughly review the results and look for trends of questions that require improvement. You should start with the most poorly performing intents, whether they are existing intents or new intents. Your approach of calibration could be refining intent groups, removing questions, adding questions, splitting intent groups, or adding new intent groups.

Conclusion

I hope this helps provide you additional guidance to model, train, test and calibrate your intents using the Watson Conversation service (or the Natural Language Classifier service). I will continue to update this blog post to share any new information that I may learn as we continue on through the cognitive journey.