Best Practices & Lessons Learned for Training Watson on Intents using Conversation Service (or Natural Language Classifier)

How to properly train Watson to understand intents is a very popular topic. This can be done using the new Watson Conversation Service (or the Natural Language Classifier service) on the IBM Bluemix platform. I have been asked on a weekly, and sometimes daily, basis on how this is best done. In an effort to help share my experience and expertise in working with Watson over the last few years, I will provide some guidance for beginning and improving your training of intents. The guidance below is presented with the intent of training a use case for a conversational bot or chat assistant, but most of the concepts can be applied to other use cases.

Reality

The fact is that there are several general best practices that one could follow but every Watson deployment will be unique and require some unique tweaking. Some practices you may not need to follow, some you may need more of and some things you will do different. This is cognitive and this is reality. You will need to think innovatively of how you can take these core practices and leverage them in your deployment.

Best Practices

Use representative end user questions – Assuming a conversational assistant scenario, these are questions that are captured from end users themselves, in their own words and terminology. This may be a bit more challenging for some organizations to obtain. Some organizations have an existing web chat or online support service that they can scrub representative questions from to use as a starting point. For those organizations that do not have this type of data available, there are other processes that can be implemented to obtain end user questions, such as polling real end users through an existing system or through a new one.

Use atleast 10 variations to train each intent – the more diverse the representative question variations you obtain are, the more thoroughly you can train your intent. For optimal results, I would target to use more than 10 variations but this depends on how well your testing goes (a later topic). I’m not saying you need 300 variations per intent, which seems overboard, but 10 is the minimum as a general guideline in my experience.

This will also depend on how many intents you have in your deployment. In cases where you have a low number of intents, such as 12 intents total, you may not need to provide more than 10 variations to achieve high confidence if the intents are very distinct.

Variations – this one has been a huge topic when training Watson. Using Watson Engagement Advisor in the days before the Conversation service/NLC, the guidance that I had followed successfully, was always to use representative questions as variations. I always followed that approach for that technology and it was proven successful.

The same approach should be followed for the Conversation service/NLC. The best practice is to collect representative questions from the actual end users, identify primary intents and use the remaining like-intent questions as variations to train Watson.

I’ve been asked “Can I make up some of my own variations to add or do those all have to come from the actual end users?” The usual guidance is that we should not be self-creating variations, specifically in conversational assistant scenarios. In working with the Conversation service/NLC, I’ve found that there may be, at times, an exception to this case. In some rare circumstances, I have not been able to get Watson to understand an intent with a high confidence level, even after adding many representative question variations. Folks, this is cognitive and there very well may be clear justification somewhere deep in Watson’s algorithmic mind, but it may not be obvious to the human eye/mind. There have been cases where I was left no choice but to provide a self-generated variation to help Watson learn better.

I want to be crystal clear that this approach should be used as a last resort and handled with care, especially when dealing with conversational bots. I do not recommend starting with this approach or overusing this approach. I stand by the approach that everyone should be using representative end user questions as their foundation for intents and variations. Intents should never be fabricated, ever. Variations can be supplemented with self-created variations on a very rare and careful basis. If you self-create all your intents, you will likely experience sub-par performance for trying to fabricate the end user speak. If you find that you need to constantly add in self-created variations to properly train, then you need to revisit your collected representative questions as there is something that is really wrong.

Structuring Intents – There are multiple ways you could structure your intents, but I’ll talk about two common methods. It is very popular and common that you will use a combination of these in your deployment.

One way is to group intents that do not contain various different entities embedded in each intent. Here is an example of some questions of an intent designed to ask about general team performance:

Can you tell me how my team is performing?
How is my team doing?
Are my team members performing well?

The other way method to structure your intents is to group questions which include multiple entities embedded in them. This is when you have one high level intent, but little sub intents within them. You would group them all together as one intent to give Watson a stronger confidence level of the high level intent, and then in Conversation/Dialog, you would use entity extraction to pull out what the sub intent is. Here is an example:

Can you tell me how my team is performing?
Can you show me performance across all teams?
How is the team performing in the US?

As you can see, all three of these questions are asking about team performance but focusing on 3 different aspects of it. Depending on your case, you could train Watson on all of these as one intent called “Team_Performance” and then use entity extraction to examine the user utterance for the sub intents (eg “across all teams” or “in the US”).

*Technically, there is another possibility where you could identify sub-intents by using multiple classifiers/intent services but that would require additional effort and configuration at the orchestration layer that I will not go into here.

Iterative Training & Testing Cycles – This is key to the Watson learning process. The sooner you can put your Watson deployment in front of real end users, the higher quality your deployment will be when you go live. Again, in some organizations, this may be easier to achieve, such as if your end users are internal employees. However, this may be more challenging for organizations where their end users are external customers, but still achievable.

I would use the same number of questions per test session and craft the sessions around the same topics (to ensure the test questions are targeted against the use case intents). Once you’ve achieved satisfactory performance, you can move on to other topics. After your first test session, capture your results that include the percentage of questions that returned the proper intent. This will become your baseline. Once you’ve got your baseline, thoroughly review the results and look for trends of questions that require improvement. You should start with the most poorly performing intents, whether they are existing intents or new intents. Your approach of calibration could be refining intent groups, removing questions, adding questions, splitting intent groups, or adding new intent groups.

Conclusion

I hope this helps provide you additional guidance to model, train, test and calibrate your intents using the Watson Conversation service (or the Natural Language Classifier service). I will continue to update this blog post to share any new information that I may learn as we continue on through the cognitive journey.