Create a lightweight text prediction application that anticipates the next word in a phrase.
Given three large (~ 200 Mb) txt files from blog, news and twitter scrapes, contruct a codebase for cleaning, crunching and showing the best guesses for the next word.
Utilize ngrams and a Kat’z back-off model to estimate the next word based on observed frequencies.
next_word
and tally the frequency of the remaining words in, gram
, as a group.gram
(i.e. 5 words) to the lest specific (i.e. 1 word)gram
that doesn’t produce a guess match from our lookup, remove the first word from the gram
and search again (i.e. “sometimes there isn’t a match” -> “there isn’t a match”)Reads in raw data, cleans data using library(tidytext)
and contrusts a large list object, ngrams
, with the 1-5 word-grams and the tallys of the most frequent next words. Also includes best_guessN()
the function that performs the back-off approximation for the user input.
Simple sidebar layout, themed with “United”.
Utilizes RDS to read in stored lookup list form ngram_algorithim.R
. Then uses reactive objects to respond to user input and trigger best_guessN()
to calculate most probably next words. Output plot is in turn triggered to build with library(ggplot)
using geom_col() + geom_label()
.
The shiny server waits until the user starts typing before any computing. All caluculations are reactive so no buttons presses are required (although it takes it a second to complete all of the output).
The sidebar panel holds the raw text input box and displays the curent parsed input text being used for lookup.
The main panel shows the most popular next words and their relative percentage of occurances based on the current parsed input text.