Tag Archives: watson

How to Check Your API Usage for IBM Bluemix Services

Ever wondered how to check the number of APIs calls that you have consumed with your IBM Bluemix services? If so, you already have that information right under your fingertips. Bluemix has a usage tracker that is easy to use and conveniently provides your API usage for the last 12 months. Please note that there are some services that do not report usage and send a separate bill. You also need to make sure that you have the appropriate privileges to access this information.

To access the Usage Dashboard, select the “Manage” menu item at the top right of your display then hover over “Billing and Usage” then select “Usage”.

You will now be looking at your Usage Dashboard, you will see a bar chart showing you overall usage charges for the last 12 months. The current month will be highlighted.

usage_charges

Now scroll down to the “Services Charges” section and you will see a list of your services. Once you’ve located the service of interest, click on the twisty (little triangle) to the right to expand the information and you will see the number of API calls and associated costs.

 

conversation_apiUsage

If you want to see usage for previous months, just scroll back up to the bar chart and select the month of interest and your services data below will refresh.

That’s all, its that simple!

 

How to Optimize Your Watson Chat Bot Application in Production

You’ve gone live into Production with Watson, great! So, the question is, what do you do now? How do you measure your deployment? How do you improve the deployment? I’d like to talk about some best practices and recommendations on how to optimize your Watson deployment once its in production. These best practices are being based on my experiences with these Watson solutions: WEA (Watson Engagement Advisor), Dialog/NLC (Natural Language Classifier) and the Watson Conversation service available on IBM Bluemix. However, these principles can be applied to other Watson services as well.

Qualitative Analysis

I just want to be clear that the approach I am speaking of below is one measurement that is specific to improving and optimizing the question-to-answer level quality (eg fine-tuning intents, see which questions are doing well/not well). To determine the true value of any cognitive chat solution, you need to look at the overall conversation level to measure if the chat provided value (To keep this focused, short and sweet, I will not go into the conversation level detail here).

One of the keys to optimizing Watson is consistent qualitative analysis (in this case, at the question-to-answer level). This is validation of the quality of Watson’s performance on a regular and consistent basis. You can setup the right frequency and numbers that work for your organization based on maturity of the deployment, deployment size, volume of activity (eg questions being asked) and resources available, but I will provide an example.

For your first time doing this, you can pull ~500 questions from your production deployment’s conversation logs (which you would do on a regular basis, such as every two weeks or less depending on sample size and available headcount) and analyze the questions asked along with the responses provided. You should create a rating scale on how you rate the responses based on correctness. For example:

4 – Perfect answer
3 – Acceptable answer
2 – Not an acceptable answer
1 – Completely unacceptable or ridiculous answer

Your goal is to assess what answers were acceptable/correct and which answers were not, which leads to what questions require attention. Once you’ve rated all 500 of those questions/responses, add up the number of 3s and 4s as “correct” and add up the 1s and 2s as “not correct” and that percentage becomes your baseline number for correctness. Keep this baseline number handy as you will need it. Now, you’re ready to take some action.

Take Action

I highly suggest that you look for trends of question intents where Watson is not performing well. I would put much less emphasis on “one-off” types of questions that Watson did not respond well to (unless it’s a critical question to the business or a liability topic). You want to focus on what the majority of your end users are asking to have the largest impact, so again, look for trends.

Once you identify a trend, you can begin to look at ways to enhance the question intent. There are many corrective actions that you could take but it all depends on your analysis of the intent. Maybe you need to add more variations to the intent to expand Watson’s understanding of how this question could be asked or phrased? Maybe you need to adjust your intent groups so that Watson can better delineate between them? I have another blog post that talks about Best Practices for Training Watson on Intents that I suggest you read to help you.

So up until now, you’ve taken a question sampling, rated that sampling on correctness, analyzed for trends and taken some corrective action. So what is next?

Iterate!

Once it’s been two weeks (or whatever your decided frequency is) since your initial sampling of questions, pull another sample of 500 questions (the frequency of how often you iterate will depend on your sample size of questions and available headcount, so this can vary). Another important thing here is to pull the same number of questions and analyze/rate them the same exact way. You also want to have the same person(s) doing this work. This is to ensure consistency and continuity across each iterative cycle. It is also important to ensure you have given enough time in between cycles, after the latest update has been pushed in production, to allow for a variety of questions to be received which helps identify trends.

Now once you’ve run through the cycle again (and each time thereafter), there are several key things you should be looking at:

1) Compare the correctness rating with the previous cycle. Are you doing better, worse or the same?
Note: You are looking for incremental improvement, so a consistent incline of even half a percent is positive progress.

2) Compare the trends of intents you identified that are not doing well with the previous cycle to see if you still have room for improvement on the same intent. Are you not performing well in the same intent(s) over and over? If so, then you need to seek Expert Advice to look into what is going wrong.

Evaluate

There is no “one size fits all” guideline as to how long one should iterate on the qualitative analysis cycles to improve their Watson deployment. There are many factors that vary from deployment to deployment such as:

  • Deployment size – The number of intents/topics covered in the deployment
  • If new intents/topics are added to the deployment
  • If requirements change
  • If new user groups/organizations are provided access to the deployment

As you evaluate Watson’s performance over time, you should compare your rating numbers and observe the trend. Eventually, you will hit the point of diminishing returns (unless you continue to add new intents/topics into the solution). You will need to make the judgement call of when you hit this point.

Feedback

One other helpful aspect you could analyze is end user feedback. I have often seen many chat applications implement a thumbs up/thumbs down rating per each content response provided by Watson. This feedback could be misleading as the end user could have given a response a thumbs down, not because it was not the right answer, but because they perhaps just did not like the answer. I highly suggest looking for trends of questions with answers rated as thumbs down. If there are a group of people all rating a specific question intent’s response as thumbs down, then I suggest having a closer look to see what corrective action may be needed. Again, focus on trends, versus a “one-off”.

The End

Optimizing Watson is not always an easy task, but following this qualitative analysis prescription consistently as described should help provide some guidance in making the task a bit easier.

Best Practices & Lessons Learned for Training Watson on Intents using Conversation Service (or Natural Language Classifier)

How to properly train Watson to understand intents is a very popular topic. This can be done using the new Watson Conversation Service (or the Natural Language Classifier service) on the IBM Bluemix platform. I have been asked on a weekly, and sometimes daily, basis on how this is best done. In an effort to help share my experience and expertise in working with Watson over the last few years, I will provide some guidance for beginning and improving your training of intents. The guidance below is presented with the intent of training a use case for a conversational bot or chat assistant, but most of the concepts can be applied to other use cases.

Reality

The fact is that there are several general best practices that one could follow but every Watson deployment will be unique and require some unique tweaking. Some practices you may not need to follow, some you may need more of and some things you will do different. This is cognitive and this is reality. You will need to think innovatively of how you can take these core practices and leverage them in your deployment.

Best Practices

Use representative end user questions – Assuming a conversational assistant scenario, these are questions that are captured from end users themselves, in their own words and terminology. This may be a bit more challenging for some organizations to obtain. Some organizations have an existing web chat or online support service that they can scrub representative questions from to use as a starting point. For those organizations that do not have this type of data available, there are other processes that can be implemented to obtain end user questions, such as polling real end users through an existing system or through a new one.

How many variations to train each intent? – Usually, the more diverse the representative question variations you obtain are, the more thoroughly you can train your intent. There is no magic number. To properly train Watson on an intent, some intents require 5 variations, some may require more than 50. For optimal results, I would target to start with 10 variations and then assess it based on your test results (a later topic). I’m not saying you need 300 variations per intent, which seems quite overboard, but 10 is a general guideline I have used in my experience that seemed to be a fair starting point.

This will also depend on how many intents you have in your deployment. In cases where you have a low number of intents, such as 12 intents total, you may not need to provide more than 10 variations to achieve high confidence if the intents are very distinct.

Variations – this one has been a huge topic when training Watson. Using Watson Engagement Advisor in the days before the Conversation service/NLC, the guidance that I had followed successfully, was always to use representative questions as variations. I always followed that approach for that technology and it was proven successful.

The same approach should be followed for the Conversation service/NLC. The best practice is to collect representative questions from the actual end users, identify primary intents and use the remaining like-intent questions as variations to train Watson.

I’ve been asked “Can I make up some of my own variations to add or do those all have to come from the actual end users?” The usual guidance is that we should not be self-creating variations, specifically in conversational assistant scenarios. In working with the Conversation service/NLC, I’ve found that there may be, at times, an exception to this case. In some rare circumstances, I have not been able to get Watson to understand an intent with a high confidence level, even after adding many representative question variations. Folks, this is cognitive and there very well may be clear justification somewhere deep in Watson’s algorithmic mind, but it may not be obvious to the human eye/mind. There have been cases where I was left no choice but to provide a self-generated variation to help Watson learn better.

I want to be crystal clear that this approach should be used as a last resort and handled with care, especially when dealing with conversational bots. I do not recommend starting with this approach or overusing this approach. I stand by the approach that everyone should be using representative end user questions as their foundation for intents and variations. Intents should never be fabricated, ever. Variations can be supplemented with self-created variations on a very rare and careful basis. If you self-create all your intents, you will likely experience sub-par performance for trying to fabricate the end user speak. If you find that you need to constantly add in self-created variations to properly train, then you need to revisit your collected representative questions as there is something that is really wrong.

Structuring Intents – There are multiple ways you could structure your intents, but I’ll talk about two common methods. It is very popular and common that you will use a combination of these in your deployment.

One way is to group intents that do not contain various different entities embedded in each intent. Here is an example of some questions of an intent designed to ask about general team performance:

Can you tell me how my team is performing?
How is my team doing?
Are my team members performing well?

The other way method to structure your intents is to group questions which include multiple entities embedded in them. This is when you have one high level intent, but little sub intents within them. You would group them all together as one intent to give Watson a stronger confidence level of the high level intent, and then in Conversation/Dialog, you would use entity extraction to pull out what the sub intent is. Here is an example:

Can you tell me how my team is performing?
Can you show me performance across all teams?
How is the team performing in the US?

As you can see, all three of these questions are asking about team performance but focusing on 3 different aspects of it. Depending on your case, you could train Watson on all of these as one intent called “Team_Performance” and then use entity extraction to examine the user utterance for the sub intents (eg “across all teams” or “in the US”).

*Technically, there is another possibility where you could identify sub-intents by using multiple classifiers/intent services but that would require additional effort and configuration at the orchestration layer that I will not go into here.

Iterative Training & Testing Cycles – This is key to the Watson learning process. The sooner you can put your Watson deployment in front of real end users, the higher quality your deployment will be when you go live. Again, in some organizations, this may be easier to achieve, such as if your end users are internal employees. However, this may be more challenging for organizations where their end users are external customers, but still achievable.

I would use the same number of questions per test session and craft the sessions around the same topics (to ensure the test questions are targeted against the use case intents). Once you’ve achieved satisfactory performance, you can move on to other topics. After your first test session, capture your results that include the percentage of questions that returned the proper intent. This will become your baseline. Once you’ve got your baseline, thoroughly review the results and look for trends of questions that require improvement. You should start with the most poorly performing intents, whether they are existing intents or new intents. Your approach of calibration could be refining intent groups, removing questions, adding questions, splitting intent groups, or adding new intent groups.

Conclusion

I hope this helps provide you additional guidance to model, train, test and calibrate your intents using the Watson Conversation service (or the Natural Language Classifier service). I will continue to update this blog post to share any new information that I may learn as we continue on through the cognitive journey.