## Problem

• Create a lightweight text prediction application that anticipates the next word in a phrase.

• Given three large (~ 200 Mb) txt files from blog, news and twitter scrapes, contruct a codebase for cleaning, crunching and showing the best guesses for the next word.

• Utilize ngrams and a Kat’z back-off model to estimate the next word based on observed frequencies.

## Solution

### N-grams

• Deconstruct the text documents into ngrams, between 2-6 words in length.
• Key the last word in that n-gram as the next_word and tally the frequency of the remaining words in, gram, as a group.
• Sort and filter the tallys to keep the top entries required to cover >50% of all cases observed.

### Back-off

• Use a Kat’z back-off model to work backwards from the most specific prediction gram (i.e. 5 words) to the lest specific (i.e. 1 word)
• For every gram that doesn’t produce a guess match from our lookup, remove the first word from the gram and search again (i.e. “sometimes there isn’t a match” -> “there isn’t a match”)
• If we get all the way back to a single word and still can’t produce a guess from our lookup, use the most common words in the lookup.

## Code

### ngram_algorithim.R

Reads in raw data, cleans data using library(tidytext) and contrusts a large list object, ngrams, with the 1-5 word-grams and the tallys of the most frequent next words. Also includes best_guessN() the function that performs the back-off approximation for the user input.

### ui.R

Simple sidebar layout, themed with “United”.

### server.R

Utilizes RDS to read in stored lookup list form ngram_algorithim.R. Then uses reactive objects to respond to user input and trigger best_guessN() to calculate most probably next words. Output plot is in turn triggered to build with library(ggplot) using geom_col() + geom_label().

## Shiny

• The shiny server waits until the user starts typing before any computing. All caluculations are reactive so no buttons presses are required (although it takes it a second to complete all of the output).

• The sidebar panel holds the raw text input box and displays the curent parsed input text being used for lookup.

• The main panel shows the most popular next words and their relative percentage of occurances based on the current parsed input text.

https://nathanday.shinyapps.io/TextPred_Shiny/

Built with Rmd. Hosted on Github. Maintained by me.