Amazon will release a conversational and knowledge data set of more than 4 million words

Amazon plans to make available a massive number of data samples targeting natural language processing research.

No, this isn’t an April Fools’ Day prank: Amazon plans to make available a massive number of data samples targeting natural language processing research. The Seattle company today said that in September 2019, it’ll release the Topical Chat data set, a corpus of crowdsourced human conversations provided to teams competing in the annual Alexa Prize Socialbot Grand Challenge.

The Topical Chat data set consists of more than 210,000 utterances or over 4,100,000 words, Amazon says, making it one of the largest public social conversation and knowledge data sets. Each of the corpus’ conversations and conversation turns are linked to knowledge provided to crowd workers, and said knowledge is collected from a range of “unstructured” and “loosely structured” text resources relating to a set of entities.