How to Optimize Your Watson Chat Bot Application in Production

You’ve gone live into Production with Watson, great! So, the question is, what do you do now? How do you measure your deployment? How do you improve the deployment? I’d like to talk about some best practices and recommendations on how to optimize your Watson deployment once its in production. These best practices are being based on my experiences with these Watson solutions: WEA (Watson Engagement Advisor), Dialog/NLC (Natural Language Classifier) and the Watson Conversation service available on IBM Bluemix. However, these principles can be applied to other Watson services as well.

Qualitative Analysis

I just want to be clear that the approach I am speaking of below is one measurement that is specific to improving and optimizing the question-to-answer level quality (eg fine-tuning intents, see which questions are doing well/not well). To determine the true value of any cognitive chat solution, you need to look at the overall conversation level to measure if the chat provided value (To keep this focused, short and sweet, I will not go into the conversation level detail here).

One of the keys to optimizing Watson is consistent qualitative analysis (in this case, at the question-to-answer level). This is validation of the quality of Watson’s performance on a regular and consistent basis. You can setup the right frequency and numbers that work for your organization based on maturity of the deployment, deployment size, volume of activity (eg questions being asked) and resources available, but I will provide an example.

For your first time doing this, you can pull ~500 questions from your production deployment’s conversation logs (which you would do on a regular basis, such as every two weeks or less depending on sample size and available headcount) and analyze the questions asked along with the responses provided. You should create a rating scale on how you rate the responses based on correctness. For example:

4 – Perfect answer
3 – Acceptable answer
2 – Not an acceptable answer
1 – Completely unacceptable or ridiculous answer

Your goal is to assess what answers were acceptable/correct and which answers were not, which leads to what questions require attention. Once you’ve rated all 500 of those questions/responses, add up the number of 3s and 4s as “correct” and add up the 1s and 2s as “not correct” and that percentage becomes your baseline number for correctness. Keep this baseline number handy as you will need it. Now, you’re ready to take some action.

Take Action

I highly suggest that you look for trends of question intents where Watson is not performing well. I would put much less emphasis on “one-off” types of questions that Watson did not respond well to (unless it’s a critical question to the business or a liability topic). You want to focus on what the majority of your end users are asking to have the largest impact, so again, look for trends.

Once you identify a trend, you can begin to look at ways to enhance the question intent. There are many corrective actions that you could take but it all depends on your analysis of the intent. Maybe you need to add more variations to the intent to expand Watson’s understanding of how this question could be asked or phrased? Maybe you need to adjust your intent groups so that Watson can better delineate between them? I have another blog post that talks about Best Practices for Training Watson on Intents that I suggest you read to help you.

So up until now, you’ve taken a question sampling, rated that sampling on correctness, analyzed for trends and taken some corrective action. So what is next?


Once it’s been two weeks (or whatever your decided frequency is) since your initial sampling of questions, pull another sample of 500 questions (the frequency of how often you iterate will depend on your sample size of questions and available headcount, so this can vary). Another important thing here is to pull the same number of questions and analyze/rate them the same exact way. You also want to have the same person(s) doing this work. This is to ensure consistency and continuity across each iterative cycle. It is also important to ensure you have given enough time in between cycles, after the latest update has been pushed in production, to allow for a variety of questions to be received which helps identify trends.

Now once you’ve run through the cycle again (and each time thereafter), there are several key things you should be looking at:

1) Compare the correctness rating with the previous cycle. Are you doing better, worse or the same?
Note: You are looking for incremental improvement, so a consistent incline of even half a percent is positive progress.

2) Compare the trends of intents you identified that are not doing well with the previous cycle to see if you still have room for improvement on the same intent. Are you not performing well in the same intent(s) over and over? If so, then you need to seek Expert Advice to look into what is going wrong.


There is no “one size fits all” guideline as to how long one should iterate on the qualitative analysis cycles to improve their Watson deployment. There are many factors that vary from deployment to deployment such as:

  • Deployment size – The number of intents/topics covered in the deployment
  • If new intents/topics are added to the deployment
  • If requirements change
  • If new user groups/organizations are provided access to the deployment

As you evaluate Watson’s performance over time, you should compare your rating numbers and observe the trend. Eventually, you will hit the point of diminishing returns (unless you continue to add new intents/topics into the solution). You will need to make the judgement call of when you hit this point.


One other helpful aspect you could analyze is end user feedback. I have often seen many chat applications implement a thumbs up/thumbs down rating per each content response provided by Watson. This feedback could be misleading as the end user could have given a response a thumbs down, not because it was not the right answer, but because they perhaps just did not like the answer. I highly suggest looking for trends of questions with answers rated as thumbs down. If there are a group of people all rating a specific question intent’s response as thumbs down, then I suggest having a closer look to see what corrective action may be needed. Again, focus on trends, versus a “one-off”.

The End

Optimizing Watson is not always an easy task, but following this qualitative analysis prescription consistently as described should help provide some guidance in making the task a bit easier.