Optimizing LDA Topic Model for Interpretability¶

With so many online reviews across many social media websites, it is hard for companies to keep track of their online reputation. Businesses can benefit immensely if they can understand general trends of what their customers are talking about online. A common method to quickly understand trends in topics being discussed in a corpus of text is Latent Dirichlet Allocation (LDA).

Latent Dirichlet Allocation:

LDA assumes each document consists of a combination of topics, and each topic consists of a combination of words. It then approximates probability distributions of topics in a given document and of words in a given topic. For example:

Document 1: Topic1 = 0.33, Topic2 = 0.33, Topic3 = 0.33
Topic 1: Product = 0.39, Payment = 0.32, Store = 0.29

LDA is a type of Bayesian Inference Model. It assumes that the topics are generated before documents, and infer topics that could have generated the a corupus of documents (a review = a document). The generative process for each document w in corpus D is as belows:

Choose $ N \mathtt{\sim} Poisson(E) $ to represent the document length distributions (other distributions can be used)
Choose $\theta \mathtt{\sim} Dir(\alpha) $ as a multinomial distribution to represent the distribution of K topics over a document
- For example, a document may consist of 1/3 product and 2/3 store
For each of N words $ w_{n}$:
- Choose a topic from the distribution $ z_{n} \mathtt{\sim} Multinomial(\theta)$
- Choose a word $ w_{n} $ from $ p(w_{n}|z_{n},\beta) $, a multinomial probability conditioned on topic $ Z_{n} $
  - E.g. if select food topic, generate word 'broccoli' with 30\% probability and bananas with 15\% probability
  - Generate document - e.g. if length is 5, then generate set of 5 words with probability distribution across words for a given topic such as "broccoli eat food animal meat"
Find set of topics that are most likely to have generated the collection of documents

Learning:

The algorithm uses collapsed Gibbs Sampling:

Given a set of documents, go through each document, and randomly assign each word in the document to one of the K topics
- This gives random assignment of topic representations of all the documents and word distributions of all the topics
- Because this is random, it will not be good
To improve, for each document D, go through each word in D:
- For each topic t, compute:
  - $ p(topic_t | document_d)$ = proportion of words in document d that are assigned to topic t
  - $ p(word_w | topic_t)$ = proportion of assignments to topic t over all documents that come from word w (how many w in all documents' words are assigned to t)
  - Reassign with a new topic, where we choose topic t with probability $p(topic_t|document_d)*p(word_w|topic_t)$, which is essentially the probability that topic t generated word w
- Update assignment of current word in the loop assuming topic assignment distributions for the rest of the corpus are correct
Keep iterating until assignments reach a steady state
- Use these assignments to estimate topic mixtures of each document (\% words assigned to each topic within that document) and word associated to each topic (\% of words assigned to each topic overall)

The dimensionality K of Dirichlet distribution (aka # of topics) is assumed to be known and fixed.

We will try LDA and explore techniques to optimize interpretability using Amazon Office Product reviews. Dataset can be found here: http://jmcauley.ucsd.edu/data/amazon/

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

/Users/Nicha/anaconda/envs/py3/lib/python3.8/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)

import spacy
import nltk
import re
import string
import pandas as pd
import numpy as np
from stop_word_list import *
from clean_text import *
import gensim
from gensim import corpora
import pyLDAvis.gensim
import matplotlib.pyplot as plt
import json
%matplotlib inline

Load and Clean Text¶

# Load the data
data=[]
for line in open('reviews_Office_Products_5.json', 'r'):
    data.append(json.loads(line))

df = pd.DataFrame(data)

df.head()

# Extract only reviews text
reviews = pd.DataFrame(df.reviewText)

clean_reviews = clean_all(reviews, 'reviewText')

clean_reviews.head()

Form Bigrams & Trigrams¶

We want to identify bigrams and trigrams so we can concatenate them and consider them as one word. Bigrams are phrases containing 2 words e.g. ‘social media’, where ‘social’ and ‘media’ are more likely to co-occur rather than appear separately. Likewise, trigrams are phrases containing 3 words that more likely co-occur e.g. ‘Proctor and Gamble’. We use Pointwise Mutual Information score to identify significant bigrams and trigrams to concatenate. We also filter bigrams or trigrams with the filter (noun/adj, noun), (noun/adj,all types,noun/adj) because these are common structures pointing out noun-type n-grams. This helps the LDA model better cluster topics.

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_documents([comment.split() for comment in clean_reviews.reviewText])
# Filter only those that occur at least 50 times
finder.apply_freq_filter(50)
bigram_scores = finder.score_ngrams(bigram_measures.pmi)

trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = nltk.collocations.TrigramCollocationFinder.from_documents([comment.split() for comment in clean_reviews.reviewText])
# Filter only those that occur at least 50 times
finder.apply_freq_filter(50)
trigram_scores = finder.score_ngrams(trigram_measures.pmi)

bigram_pmi = pd.DataFrame(bigram_scores)
bigram_pmi.columns = ['bigram', 'pmi']
bigram_pmi.sort_values(by='pmi', axis = 0, ascending = False, inplace = True)

trigram_pmi = pd.DataFrame(trigram_scores)
trigram_pmi.columns = ['trigram', 'pmi']
trigram_pmi.sort_values(by='pmi', axis = 0, ascending = False, inplace = True)

# Filter for bigrams with only noun-type structures
def bigram_filter(bigram):
    tag = nltk.pos_tag(bigram)
    if tag[0][1] not in ['JJ', 'NN'] and tag[1][1] not in ['NN']:
        return False
    if bigram[0] in stop_word_list or bigram[1] in stop_word_list:
        return False
    if 'n' in bigram or 't' in bigram:
        return False
    if 'PRON' in bigram:
        return False
    return True

# Filter for trigrams with only noun-type structures
def trigram_filter(trigram):
    tag = nltk.pos_tag(trigram)
    if tag[0][1] not in ['JJ', 'NN'] and tag[1][1] not in ['JJ','NN']:
        return False
    if trigram[0] in stop_word_list or trigram[-1] in stop_word_list or trigram[1] in stop_word_list:
        return False
    if 'n' in trigram or 't' in trigram:
         return False
    if 'PRON' in trigram:
        return False
    return True

# Can set pmi threshold to whatever makes sense - eyeball through and select threshold where n-grams stop making sense
# choose top 500 ngrams in this case ranked by PMI that have noun like structures
filtered_bigram = bigram_pmi[bigram_pmi.apply(lambda bigram:\
                                              bigram_filter(bigram['bigram'])\
                                              and bigram.pmi > 5, axis = 1)][:500]

filtered_trigram = trigram_pmi[trigram_pmi.apply(lambda trigram: \
                                                 trigram_filter(trigram['trigram'])\
                                                 and trigram.pmi > 5, axis = 1)][:500]


bigrams = [' '.join(x) for x in filtered_bigram.bigram.values if len(x[0]) > 2 or len(x[1]) > 2]
trigrams = [' '.join(x) for x in filtered_trigram.trigram.values if len(x[0]) > 2 or len(x[1]) > 2 and len(x[2]) > 2]

# examples of bigrams
bigrams[:10]

['hewlett packard',
 'wal mart',
 'akt le',
 'http www',
 'carpal tunnel',
 'snow leopard',
 'allen wrench',
 'trapper keeper',
 'stanley bostitch',
 'recommended cfh']

# examples of trigrams
trigrams[:10]

['http www amazon',
 'epson labelworks lw',
 'www amazon com',
 'cyan magenta yellow',
 'thermal laminating pouch',
 'adhesive dot roller',
 'wilson jones heavy',
 'black cyan magenta',
 'automatic document feeder',
 'amazon vine program']

# Concatenate n-grams
def replace_ngram(x):
    for gram in trigrams:
        x = x.replace(gram, '_'.join(gram.split()))
    for gram in bigrams:
        x = x.replace(gram, '_'.join(gram.split()))
    return x

reviews_w_ngrams = clean_reviews.copy()

reviews_w_ngrams.reviewText = reviews_w_ngrams.reviewText.map(lambda x: replace_ngram(x))

# tokenize reviews + remove stop words + remove names + remove words with less than 2 characters
reviews_w_ngrams = reviews_w_ngrams.reviewText.map(lambda x: [word for word in x.split()\
                                                 if word not in stop_word_list\
                                                              and word not in english_names\
                                                              and len(word) > 2])

reviews_w_ngrams.head()

0    [buy, PRON, PRON, serve, PRON, faithfully, los...
1    [belate, review, feel, obliged, share, PRON, v...
2    [kick, year, year, old, flawless, month, numbe...
3    [start, finance, stuff, recently, look, good, ...
4    [simple, calculation, discount, cash, flow, go...
Name: reviewText, dtype: object

Filter for only nouns¶

Nouns are most likely indicators of a topic. For example, for the sentence ‘The store is nice’, we know the sentence is talking about ‘store’. The other words in the sentence provide more context and explanation about the topic (‘store’) itself. Therefore, filtering for the noun cleans the text for words that are more interpretable in the topic model.

# Filter for only nouns
def noun_only(x):
    pos_comment = nltk.pos_tag(x)
    filtered = [word[0] for word in pos_comment if word[1] in ['NN']]
    # to filter both noun and verbs
    #filtered = [word[0] for word in pos_comment if word[1] in ['NN','VB', 'VBD', 'VBG', 'VBN', 'VBZ']]
    return filtered

final_reviews = reviews_w_ngrams.map(noun_only)

LDA Model¶

dictionary = corpora.Dictionary(final_reviews)

doc_term_matrix = [dictionary.doc2bow(doc) for doc in final_reviews]

Optimize # of k topics¶

LDA requires that we specify the number of topics that exists in a corpus of text. There are several common measures that can be optimized, such as predictive likelihood, perplexity, and coherence. Much literature has indicated that maximizing coherence, particularly a measure named Cv (https://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf), leads to better human interpretability. This measure assesses the interpretability of topics given the set of words in generated topics. Therefore, we will optimize this measure.

Since eval_only calculates perplexity metric, we can set it to None to save time, as we will use a different metric called Cv.

Try 5-25 topics...

coherence = []
for k in range(5,25):
    print('Round: '+str(k))
    Lda = gensim.models.ldamodel.LdaModel
    ldamodel = Lda(doc_term_matrix, num_topics=k, id2word = dictionary, passes=40,\
                   iterations=200, chunksize = 10000, eval_every = None)
    
    cm = gensim.models.coherencemodel.CoherenceModel(model=ldamodel, texts=final_reviews,\
                                                     dictionary=dictionary, coherence='c_v')
    coherence.append((k,cm.get_coherence()))

Round: 5
Round: 6
Round: 7
Round: 8
Round: 9
Round: 10
Round: 11
Round: 12
Round: 13
Round: 14
Round: 15
Round: 16
Round: 17
Round: 18
Round: 19
Round: 20
Round: 21
Round: 22
Round: 23
Round: 24

x_val = [x[0] for x in coherence]
y_val = [x[1] for x in coherence]

plt.plot(x_val,y_val)
plt.scatter(x_val,y_val)
plt.title('Number of Topics vs. Coherence')
plt.xlabel('Number of Topics')
plt.ylabel('Coherence')
plt.xticks(x_val)
plt.show()

The improvement stops to significantly improve after 15 topics. It is not always where the highest Cv is, so we can try a couple to see which has the best result. We'll try 15 and 23 here. Adding topics can help reveal further sub topics. Nonetheless, if the same words start to appear across multiple topics, the number of topics is too high.

Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(doc_term_matrix, num_topics=15, id2word = dictionary, passes=40,\
               iterations=200,  chunksize = 10000, eval_every = None, random_state=0)

Lda2 = gensim.models.ldamodel.LdaModel
ldamodel2 = Lda2(doc_term_matrix, num_topics=23, id2word = dictionary, passes=40,\
               iterations=200,  chunksize = 10000, eval_every = None, random_state=0)

Some explanation of the parameters that can be tuned:

Passes: The number of times model iterates through the whole corpus
Iterations: The number of iterations the model trains on each pass
Chunk size: Number of rows that are taken to train the model each

Other explanation of parameters: https://radimrehurek.com/gensim/models/ldamodel.html

# To show initial topics
ldamodel.show_topics(15, num_words=10, formatted=False)

[(0,
  [('note', 0.05554076),
   ('color', 0.051286757),
   ('use', 0.048626028),
   ('post', 0.04068646),
   ('marker', 0.033643845),
   ('board', 0.028574942),
   ('work', 0.0190825),
   ('magnet', 0.01893561),
   ('pad', 0.015846577),
   ('stick', 0.01408235)]),
 (1,
  [('folder', 0.08729111),
   ('tab', 0.085292436),
   ('chair', 0.06593559),
   ('file', 0.06276117),
   ('use', 0.02441812),
   ('file_folder', 0.022120677),
   ('office', 0.021025151),
   ('color', 0.018232403),
   ('seat', 0.014802855),
   ('product', 0.013404625)]),
 (2,
  [('binder', 0.09963777),
   ('pocket', 0.045993555),
   ('paper', 0.043687668),
   ('cover', 0.03973441),
   ('use', 0.028761666),
   ('notebook', 0.02356954),
   ('plastic', 0.019488329),
   ('divider', 0.0129221305),
   ('sheet', 0.010436761),
   ('look', 0.009696761)]),
 (3,
  [('pencil', 0.046214968),
   ('use', 0.039241314),
   ('work', 0.017876107),
   ('hand', 0.017292611),
   ('point', 0.015187497),
   ('mouse', 0.012842767),
   ('feel', 0.012571061),
   ('time', 0.011546981),
   ('gel', 0.010329655),
   ('sharpener', 0.010041249)]),
 (4,
  [('paper', 0.29295966),
   ('sheet', 0.057533994),
   ('cut', 0.027506119),
   ('line', 0.0265265),
   ('use', 0.014041655),
   ('edge', 0.009124642),
   ('time', 0.008825456),
   ('scissor', 0.008782924),
   ('cutter', 0.008229269),
   ('try', 0.007084295)]),
 (5,
  [('label', 0.1723103),
   ('use', 0.053069115),
   ('print', 0.042996094),
   ('avery', 0.02533661),
   ('product', 0.023779038),
   ('template', 0.019389177),
   ('printer', 0.016582226),
   ('sheet', 0.016408889),
   ('work', 0.014924869),
   ('design', 0.013208105)]),
 (6,
  [('box', 0.06719012),
   ('use', 0.043874178),
   ('stapler', 0.028905349),
   ('laminator', 0.020024901),
   ('work', 0.019376948),
   ('staple', 0.017313404),
   ('size', 0.016116722),
   ('time', 0.014754222),
   ('sheet', 0.01393789),
   ('machine', 0.013884576)]),
 (7,
  [('printer', 0.090115316),
   ('print', 0.0597321),
   ('use', 0.029410886),
   ('paper', 0.018815335),
   ('photo', 0.018737754),
   ('color', 0.016101355),
   ('epson', 0.015446978),
   ('quality', 0.014941996),
   ('work', 0.014857389),
   ('scanner', 0.014041543)]),
 (8,
  [('desk', 0.034735076),
   ('use', 0.022344066),
   ('wall', 0.017067196),
   ('work', 0.016676415),
   ('place', 0.015774151),
   ('look', 0.014846028),
   ('product', 0.012962775),
   ('holder', 0.012792616),
   ('mount', 0.012480653),
   ('thing', 0.012047083)]),
 (9,
  [('device', 0.04520504),
   ('button', 0.044987217),
   ('power', 0.03331916),
   ('use', 0.025703644),
   ('battery', 0.023691626),
   ('unit', 0.021187603),
   ('iphone', 0.017915428),
   ('wifi', 0.017026613),
   ('plug', 0.016983017),
   ('ipad', 0.016728291)]),
 (10,
  [('card', 0.05332803),
   ('envelope', 0.04079895),
   ('use', 0.032934155),
   ('look', 0.024511479),
   ('letter', 0.022425689),
   ('gift', 0.022383856),
   ('product', 0.019586928),
   ('mail', 0.01848597),
   ('design', 0.014794249),
   ('coupon', 0.014493796)]),
 (11,
  [('tape', 0.1346477),
   ('use', 0.045706317),
   ('dispenser', 0.029944325),
   ('roll', 0.02497533),
   ('scotch', 0.019148348),
   ('package', 0.018629175),
   ('product', 0.017392516),
   ('box', 0.014107367),
   ('work', 0.013217546),
   ('stick', 0.012427195)]),
 (12,
  [('ink', 0.0934071),
   ('price', 0.069271475),
   ('cartridge', 0.043713454),
   ('quality', 0.029901318),
   ('cost', 0.027726078),
   ('color', 0.023341969),
   ('use', 0.023152564),
   ('product', 0.021041602),
   ('work', 0.02068136),
   ('purchase', 0.018761719)]),
 (13,
  [('phone', 0.08554327),
   ('number', 0.026325107),
   ('unit', 0.0260776),
   ('use', 0.025917714),
   ('feature', 0.020520078),
   ('work', 0.017577784),
   ('scale', 0.015635148),
   ('time', 0.014621188),
   ('service', 0.014241095),
   ('base', 0.014212798)]),
 (14,
  [('use', 0.043373954),
   ('thing', 0.02428259),
   ('book', 0.023880767),
   ('work', 0.022635672),
   ('year', 0.020149278),
   ('shredder', 0.019987088),
   ('time', 0.018705543),
   ('school', 0.017381074),
   ('day', 0.013657781),
   ('home', 0.012003456)])]

# To show initial topics
ldamodel2.show_topics(23, num_words=10, formatted=False)

[(0,
  [('color', 0.13042007),
   ('use', 0.04939971),
   ('marker', 0.049138594),
   ('ink', 0.045652367),
   ('write', 0.025547048),
   ('tip', 0.02120447),
   ('work', 0.01799116),
   ('sharpie', 0.015334195),
   ('line', 0.013950256),
   ('cap', 0.01391556)]),
 (1,
  [('tab', 0.12561414),
   ('folder', 0.105585515),
   ('file', 0.07379385),
   ('use', 0.044929538),
   ('file_folder', 0.028123518),
   ('color', 0.027827552),
   ('document', 0.018135134),
   ('divider', 0.017267503),
   ('product', 0.014797443),
   ('work', 0.012500131)]),
 (2,
  [('binder', 0.13120762),
   ('pocket', 0.060566276),
   ('cover', 0.051677737),
   ('paper', 0.03549023),
   ('notebook', 0.029100694),
   ('use', 0.024918718),
   ('plastic', 0.022741886),
   ('spine', 0.01185665),
   ('look', 0.010828309),
   ('hold', 0.0097193085)]),
 (3,
  [('scanner', 0.037928365),
   ('software', 0.032020744),
   ('use', 0.02661672),
   ('device', 0.019739488),
   ('work', 0.018324248),
   ('document', 0.01643745),
   ('image', 0.013871932),
   ('option', 0.010788011),
   ('time', 0.010738234),
   ('app', 0.009616761)]),
 (4,
  [('line', 0.04810502),
   ('list', 0.032569233),
   ('cut', 0.025306059),
   ('section', 0.024392368),
   ('space', 0.01914207),
   ('use', 0.013752885),
   ('help', 0.012654899),
   ('word', 0.012498034),
   ('scissor', 0.011455204),
   ('time', 0.011337658)]),
 (5,
  [('label', 0.09545807),
   ('use', 0.038240943),
   ('label_maker', 0.0367063),
   ('size', 0.032777734),
   ('print', 0.02810787),
   ('option', 0.020425789),
   ('brother', 0.016577503),
   ('tape', 0.0145291425),
   ('dymo', 0.014353474),
   ('font', 0.013756974)]),
 (6,
  [('use', 0.055573765),
   ('stapler', 0.046579275),
   ('laminator', 0.03226891),
   ('work', 0.029730076),
   ('staple', 0.02789948),
   ('machine', 0.022236329),
   ('sheet', 0.020912863),
   ('time', 0.019748287),
   ('laminate', 0.018641733),
   ('office', 0.012587159)]),
 (7,
  [('printer', 0.11308775),
   ('print', 0.072835185),
   ('use', 0.03140155),
   ('color', 0.021372763),
   ('photo', 0.020456113),
   ('cartridge', 0.018776828),
   ('ink', 0.018743694),
   ('quality', 0.01753501),
   ('epson', 0.016801588),
   ('work', 0.014786961)]),
 (8,
  [('board', 0.057055254),
   ('magnet', 0.036343493),
   ('use', 0.03401388),
   ('wall', 0.029806193),
   ('work', 0.021534385),
   ('hang', 0.019245533),
   ('surface', 0.017176444),
   ('fridge', 0.015672252),
   ('size', 0.015249013),
   ('frame', 0.014354303)]),
 (9,
  [('button', 0.048002463),
   ('display', 0.04251696),
   ('battery', 0.04007704),
   ('power', 0.030529827),
   ('function', 0.022540586),
   ('use', 0.020540902),
   ('unit', 0.020411884),
   ('calculator', 0.019436345),
   ('press', 0.01764168),
   ('light', 0.015444978)]),
 (10,
  [('card', 0.07395634),
   ('picture', 0.033476386),
   ('fun', 0.030240938),
   ('gift', 0.030221147),
   ('use', 0.029920796),
   ('iphone', 0.029108062),
   ('look', 0.025734916),
   ('love', 0.024445744),
   ('design', 0.015442862),
   ('cute', 0.012419679)]),
 (11,
  [('tape', 0.17635933),
   ('use', 0.049484335),
   ('dispenser', 0.040408727),
   ('roll', 0.031698085),
   ('scotch', 0.023662435),
   ('package', 0.019743662),
   ('product', 0.016282355),
   ('work', 0.014740866),
   ('hand', 0.013856983),
   ('time', 0.011419797)]),
 (12,
  [('price', 0.08725204),
   ('ink', 0.04019026),
   ('quality', 0.03321029),
   ('product', 0.03177333),
   ('cost', 0.030540118),
   ('cartridge', 0.027756458),
   ('brand', 0.023580289),
   ('buy', 0.023349397),
   ('work', 0.02322344),
   ('purchase', 0.02301751)]),
 (13,
  [('phone', 0.10067777),
   ('shredder', 0.04033353),
   ('use', 0.028559031),
   ('work', 0.022776622),
   ('feature', 0.022676079),
   ('unit', 0.022664795),
   ('number', 0.021434642),
   ('time', 0.018948015),
   ('wifi', 0.016510718),
   ('service', 0.014403109)]),
 (14,
  [('use', 0.06622348),
   ('work', 0.030752605),
   ('school', 0.029984517),
   ('thing', 0.028314037),
   ('year', 0.02751095),
   ('home', 0.024571154),
   ('day', 0.019894343),
   ('office', 0.019291528),
   ('time', 0.015584958),
   ('month', 0.013385191)]),
 (15,
  [('note', 0.13345195),
   ('post', 0.0915535),
   ('book', 0.06520335),
   ('use', 0.04418773),
   ('stick', 0.023254061),
   ('color', 0.023015738),
   ('pad', 0.022719262),
   ('product', 0.0217553),
   ('mark', 0.019886099),
   ('sticky_note', 0.01667514)]),
 (16,
  [('paper', 0.38546053),
   ('sheet', 0.0729975),
   ('envelope', 0.023417415),
   ('size', 0.013255699),
   ('piece', 0.010087976),
   ('photo', 0.008301831),
   ('quality', 0.008212537),
   ('airprint', 0.0070669306),
   ('letter', 0.0069985334),
   ('stack', 0.0069163973)]),
 (17,
  [('use', 0.053630367),
   ('surface', 0.051243808),
   ('pad', 0.037238825),
   ('mouse', 0.033375982),
   ('work', 0.024146406),
   ('product', 0.021731285),
   ('glue', 0.019381285),
   ('wrist', 0.019195158),
   ('keyboard', 0.017723132),
   ('mouse_pad', 0.017507704)]),
 (18,
  [('label', 0.16575454),
   ('use', 0.05483286),
   ('print', 0.044304166),
   ('product', 0.026850592),
   ('avery', 0.026732037),
   ('template', 0.022258537),
   ('printer', 0.01962065),
   ('sheet', 0.018373849),
   ('work', 0.017547501),
   ('design', 0.015463634)]),
 (19,
  [('box', 0.18429852),
   ('item', 0.03778836),
   ('use', 0.026292147),
   ('storage', 0.025811281),
   ('store', 0.025140334),
   ('size', 0.021184357),
   ('cardboard', 0.017525708),
   ('pack', 0.015550314),
   ('thing', 0.011347521),
   ('handle', 0.009608397)]),
 (20,
  [('pencil', 0.11590351),
   ('use', 0.031844214),
   ('sharpener', 0.025182469),
   ('point', 0.023583025),
   ('lead', 0.02334912),
   ('scale', 0.022006663),
   ('work', 0.01728826),
   ('clip', 0.016115276),
   ('hand', 0.015239838),
   ('grip', 0.0142835155)]),
 (21,
  [('product', 0.04753347),
   ('holder', 0.04566283),
   ('letter', 0.029462313),
   ('look', 0.023575727),
   ('drawer', 0.022021983),
   ('case', 0.01852576),
   ('mail', 0.01749215),
   ('design', 0.017376171),
   ('place', 0.017036516),
   ('item', 0.016385237)]),
 (22,
  [('desk', 0.06657937),
   ('chair', 0.04610483),
   ('monitor', 0.023994518),
   ('sit', 0.020361725),
   ('arm', 0.01725616),
   ('look', 0.0159833),
   ('use', 0.015491127),
   ('work', 0.015354036),
   ('office', 0.015131725),
   ('foot', 0.01365214)])]

23 topics yielded clearer results, so we'll go with this...

Relevancy¶

Sometimes, words that are ranked as top words for a given topic may be ranked high because they are globally frequent across text in a corpus. Relevancy score helps to prioritize terms that belong more exclusively to a given topic. This can increase interpretability even more. The relevance of term w to topic k is defined as:

$ r(w,k| \lambda) = \lambda log(\phi_{kw}) +(1-\lambda)log(\frac{\phi_{kw}}{p_{kw}}) $

where $\phi_{kw}$ is the probability of term w in topic k and $\frac{\phi_{kw}}{p_{kw}}$ is lift in term's probability within a topic to its marginal probability across the corpus (this helps discards globally frequent terms). A lower lambda value gives more importance to the second term, which gives more importance to topic exclusivity. We can use Python’s pyLDAvis for this. For example, when lowering lambda, we can see that topic 0 ranked terms that are even more relevant to the topic of hair salon service the highest.

Paper behind this tool: https://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf

topic_data =  pyLDAvis.gensim.prepare(ldamodel2, doc_term_matrix, dictionary, mds = 'pcoa')
pyLDAvis.display(topic_data)

The pyLDAvis tool also gives two other important pieces of information. The circles represent each topic. The distance between the circles visualizes how related topics are to each other. The above plot shows that our topics are quite distinct. The dimensionality reduction can be chosen as PCA or t-sne. The above example uses t-sne. Additionally, the size of the circle represents how prevalent that topic is across the corpus of reviews. For example, topic 1 in this case makes up the majority of the documents, constituting 17.1% of the tokens.

To extract the words for a given lambda:

all_topics = {}
num_terms = 10 # Adjust number of words to represent each topic
lambd = 0.6 # Adjust this accordingly based on tuning above
for i in range(1,24): #Adjust this to reflect number of topics chosen for final LDA model
    topic = topic_data.topic_info[topic_data.topic_info.Category == 'Topic'+str(i)].copy()
    topic['relevance'] = topic['loglift']*(1-lambd)+topic['logprob']*lambd
    all_topics['Topic '+str(i)] = topic.sort_values(by='relevance', ascending=False).Term[:num_terms].values

pd.DataFrame(all_topics).T

We can see here that most topics are quite clear. In this case, they are clustered into the different products being talked about in the reviews. Some clear topics are:

Printer products
Scanner proudcts
Printer Ink, pricing/quality
School stationary
Lamination
Packing tape
Mailing labels
Markers
Magnetic boards
Workstation
Binder
Paper products
Phone
Post-its
Storage/packing materials
Mouse/keyboard
Pencil
Cutting
File Organizer
Business card
Calculator
Mailing materials
Label maker

Other models:

Non-negative matrix factorization (NMF): Less computationally expensive, but assumes fixed probability vectors of multinomials across documents, while LDA allows this to vary. If we believe the topic probabilities should remain fixed for each document, or in small data setting where additional variability from priors is too much, NMF might be better.
Latent semantic indexing: a variant of PCA/SVD. Again, much faster, but generally less accurate. Similar to NMF, but different loss constraint.
Hierarchical Dirchlet Process: Uses posterior inference to determine the number of topics for you

https://www.cs.cmu.edu/~epxing/Class/10708-15/slides/LDA_SC.pdf
https://en.wikipedia.org/wiki/Hierarchical_Dirichlet_process

	reviewerID	asin	reviewerName	helpful	reviewText	overall	summary	unixReviewTime	reviewTime
0	A32T2H8150OJLU	B00000JBLH	ARH	[3, 4]	I bought my first HP12C in about 1984 or so, a...	5.0	A solid performer, and long time friend	1094169600	09 3, 2004
1	A3MAFS04ZABRGO	B00000JBLH	Let it Be "Alan"	[7, 9]	WHY THIS BELATED REVIEW? I feel very obliged t...	5.0	Price of GOLD is up, so don't bury the golden ...	1197676800	12 15, 2007
2	A1F1A0QQP2XVH5	B00000JBLH	Mark B	[3, 3]	I have an HP 48GX that has been kicking for mo...	2.0	Good functionality, but not durable like old HPs	1293840000	01 1, 2011
3	A49R5DBXXQDE5	B00000JBLH	R. D Johnson	[7, 8]	I've started doing more finance stuff recently...	5.0	One of the last of an almost extinct species	1145404800	04 19, 2006
4	A2XRMQA6PJ5ZJ8	B00000JBLH	Roger J. Buffington	[0, 0]	For simple calculations and discounted cash fl...	5.0	Still the best	1375574400	08 4, 2013

	reviewText
0	b i buy PRON first hp c in about or so and PRO...
1	b why this belate review i feel very obliged t...
2	b i have an hp gx that have be kick for more t...
3	b i ve start do more finance stuff recently an...
4	b for simple calculation and discount cash flo...

	0	1	2	3	4	5	6	7	8	9
Topic 1	printer	print	photo	epson	cartridge	ink	canon	color	wireless	fax
Topic 2	scanner	software	device	image	app	document	tablet	program	option	ipad
Topic 3	price	cost	ink	buy	amazon	brand	purchase	pay	cartridge	quality
Topic 4	school	use	year	home	child	day	thing	kid	organizer	son
Topic 5	stapler	laminator	staple	laminate	swingline	machine	use	heat	sheet	lamination
Topic 6	tape	dispenser	roll	scotch	package	use	wrap	strip	packing_tape	packaging
Topic 7	label	avery	template	print	peel	address	use	ticket	product	sticker
Topic 8	color	marker	write	ink	tip	sharpie	cap	erase	blue	skin
Topic 9	board	magnet	wall	hang	fridge	frame	door	refrigerator	mount	surface
Topic 10	desk	chair	monitor	arm	sit	foot	seat	password	position	leg
Topic 11	binder	pocket	cover	notebook	spine	plastic	ring	paper	divider	hinge
Topic 12	paper	sheet	envelope	airprint	security	control_panel	card_stock	pastel	ream	cardstock
Topic 13	phone	shredder	wifi	handset	sound	service	number	unit	cell_phone	message
Topic 14	note	post	book	sticky_note	mark	pad	stick	flag	use	cloud
Topic 15	box	storage	item	cardboard	store	banker_box	pack	handle	size	bubble_wrap
Topic 16	surface	mouse	pad	wrist	glue	mouse_pad	keyboard	bottle	rest	use
Topic 17	pencil	sharpener	scale	lead	pencil_sharpener	cup	barrel	mechanical_pencil	point	clip
Topic 18	line	list	section	cut	scissor	speaker	cutter	activity	trimmer	blade
Topic 19	tab	folder	file	file_folder	divider	paperwork	filing	tax	file_cabinet	use
Topic 20	card	iphone	gift	fun	picture	cute	love	art	business_card	shirt
Topic 21	display	battery	button	calculator	power	function	key	press	ounce	light
Topic 22	holder	letter	drawer	mail	product	hanging_folder	filing_cabinet	slot	stamp	monitor_arm
Topic 23	label_maker	label	dymo	font	labeler	symbol	comb	border	brother	versatile