Problem

Solution

N-grams

  • Deconstruct the text documents into ngrams, between 2-6 words in length.
  • Key the last word in that n-gram as the next_word and tally the frequency of the remaining words in, gram, as a group.
  • Sort and filter the tallys to keep the top entries required to cover >50% of all cases observed.

Back-off

  • Use a Kat’z back-off model to work backwards from the most specific prediction gram (i.e. 5 words) to the lest specific (i.e. 1 word)
  • For every gram that doesn’t produce a guess match from our lookup, remove the first word from the gram and search again (i.e. “sometimes there isn’t a match” -> “there isn’t a match”)
  • If we get all the way back to a single word and still can’t produce a guess from our lookup, use the most common words in the lookup.

Code

ngram_algorithim.R

Reads in raw data, cleans data using library(tidytext) and contrusts a large list object, ngrams, with the 1-5 word-grams and the tallys of the most frequent next words. Also includes best_guessN() the function that performs the back-off approximation for the user input.

ui.R

Simple sidebar layout, themed with “United”.

server.R

Utilizes RDS to read in stored lookup list form ngram_algorithim.R. Then uses reactive objects to respond to user input and trigger best_guessN() to calculate most probably next words. Output plot is in turn triggered to build with library(ggplot) using geom_col() + geom_label().

Shiny

https://nathanday.shinyapps.io/TextPred_Shiny/


Built with Rmd. Hosted on Github. Maintained by me. Creative Commons License