Introduction
Natural language generation lies at the core of generative dialogue systems and conversational agents. This paper studies two statistical generators based on Long Short-term Memory neural networks. Both networks learn from unaligned data and jointly perform sentence planning and surface realizations.
This is a promising new approach for natural language generation, since it eliminates the need for hard-coded heuristic rules and hence is easily generalizable to new domains. FInally, post processing is performed on the output from the neural network architecture, to make the utterances more readable and natural.
Approach
With the continuous efforts to develop general artificial intelligence, there has recently been a substantial amount of research done in the fields of natural language processing (NLP) and generation (NLG). It gave rise to popular digital personal assistants in smartphones, such as Siri or Cortana, as well as separate devices with the sole purpose of offering the services of such personal assistants through interactive communication, such as Amazon Echo (powered by the Alexa assistant) or Google Home. The capabilities of these conversational agents are still fairly limited and lacking in various aspects, ranging from distinguishing the sentiment of human utterances to producing utterances with human-like coherence and naturalness.
In our work we focus on the NLG module in task-oriented dialogue systems, in particular on generation of textual utterances from structured meaning representations (MRs). An MR describes a single dialogue act in the form of a list of pieces of information that need to be conveyed to the other party in the dialogue (typically a human user). Each piece of information is represented by a slot-value pair, where the slot identifies the type of information and the value is the corresponding content (see image on the left). Dialogue acts (DAs) can be of different types, depending on the intent of the dialogue manager component. They can range from simple ones, such as a goodbye DA with no slots at all, to complex ones, such as an inform DA containing multiple slots with various types of values.
Key points:
- We have implemented an encoder-decoder LSTM model and augmented its capabilities with a variety of different post-processing steps.
- The model is capable of generating reasonable utterances, but there are still margins for improvement.
- This is where the attention mechanism comes to the rescue (see latest project/publication).
- It is an interesting task to attempt translating the entire training utterance into an appropriate slot value representation. We hypothesize this technique will result in better transitions between slot values from a given meaning representation and work as an automatic method for aligning our currently unaligned data.
- We also plan on incorporating beam search to our decoder in order to efficiently identify the most probable output sequences.