Optimizing LDA Topic Model for Interpretability

With so many online reviews across many social media websites, it is hard for companies to keep track of their online reputation. Businesses can benefit immensely if they can understand general trends of what their customers are talking about online. A common method to quickly understand trends in topics being discussed in a corpus of text is Latent Dirichlet Allocation (LDA).

Latent Dirichlet Allocation:

LDA assumes each document consists of a combination of topics, and each topic consists of a combination of words. It then approximates probability distributions of topics in a given document and of words in a given topic. For example:

  • Document 1: Topic1 = 0.33, Topic2 = 0.33, Topic3 = 0.33
  • Topic 1: Product = 0.39, Payment = 0.32, Store = 0.29

LDA is a type of Bayesian Inference Model. It assumes that the topics are generated before documents, and infer topics that could have generated the a corupus of documents (a review = a document). The generative process for each document w in corpus D is as belows:

  1. Choose $ N \mathtt{\sim} Poisson(E) $ to represent the document length distributions (other distributions can be used)
  2. Choose $\theta \mathtt{\sim} Dir(\alpha) $ as a multinomial distribution to represent the distribution of K topics over a document
    • For example, a document may consist of 1/3 product and 2/3 store
  3. For each of N words $ w_{n}$:
    • Choose a topic from the distribution $ z_{n} \mathtt{\sim} Multinomial(\theta)$
    • Choose a word $ w_{n} $ from $ p(w_{n}|z_{n},\beta) $, a multinomial probability conditioned on topic $ Z_{n} $
      • E.g. if select food topic, generate word 'broccoli' with 30\% probability and bananas with 15\% probability
      • Generate document - e.g. if length is 5, then generate set of 5 words with probability distribution across words for a given topic such as "broccoli eat food animal meat"
  4. Find set of topics that are most likely to have generated the collection of documents

Learning:

The algorithm uses collapsed Gibbs Sampling:

  1. Given a set of documents, go through each document, and randomly assign each word in the document to one of the K topics
    • This gives random assignment of topic representations of all the documents and word distributions of all the topics
    • Because this is random, it will not be good
  2. To improve, for each document D, go through each word in D:
    • For each topic t, compute:
      • $ p(topic_t | document_d)$ = proportion of words in document d that are assigned to topic t
      • $ p(word_w | topic_t)$ = proportion of assignments to topic t over all documents that come from word w (how many w in all documents' words are assigned to t)
      • Reassign with a new topic, where we choose topic t with probability $p(topic_t|document_d)*p(word_w|topic_t)$, which is essentially the probability that topic t generated word w
    • Update assignment of current word in the loop assuming topic assignment distributions for the rest of the corpus are correct
  3. Keep iterating until assignments reach a steady state
    • Use these assignments to estimate topic mixtures of each document (\% words assigned to each topic within that document) and word associated to each topic (\% of words assigned to each topic overall)

The dimensionality K of Dirichlet distribution (aka # of topics) is assumed to be known and fixed.

We will try LDA and explore techniques to optimize interpretability using Amazon Office Product reviews. Dataset can be found here: http://jmcauley.ucsd.edu/data/amazon/

In [2]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 
/Users/Nicha/anaconda/envs/py3/lib/python3.8/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)
In [1]:
import spacy
import nltk
import re
import string
import pandas as pd
import numpy as np
from stop_word_list import *
from clean_text import *
import gensim
from gensim import corpora
import pyLDAvis.gensim
import matplotlib.pyplot as plt
import json
%matplotlib inline

Load and Clean Text

In [3]:
# Load the data
data=[]
for line in open('reviews_Office_Products_5.json', 'r'):
    data.append(json.loads(line))
In [4]:
df = pd.DataFrame(data)
In [5]:
df.head()
Out[5]:
reviewerID asin reviewerName helpful reviewText overall summary unixReviewTime reviewTime
0 A32T2H8150OJLU B00000JBLH ARH [3, 4] I bought my first HP12C in about 1984 or so, a... 5.0 A solid performer, and long time friend 1094169600 09 3, 2004
1 A3MAFS04ZABRGO B00000JBLH Let it Be "Alan" [7, 9] WHY THIS BELATED REVIEW? I feel very obliged t... 5.0 Price of GOLD is up, so don't bury the golden ... 1197676800 12 15, 2007
2 A1F1A0QQP2XVH5 B00000JBLH Mark B [3, 3] I have an HP 48GX that has been kicking for mo... 2.0 Good functionality, but not durable like old HPs 1293840000 01 1, 2011
3 A49R5DBXXQDE5 B00000JBLH R. D Johnson [7, 8] I've started doing more finance stuff recently... 5.0 One of the last of an almost extinct species 1145404800 04 19, 2006
4 A2XRMQA6PJ5ZJ8 B00000JBLH Roger J. Buffington [0, 0] For simple calculations and discounted cash fl... 5.0 Still the best 1375574400 08 4, 2013
In [6]:
# Extract only reviews text
reviews = pd.DataFrame(df.reviewText)
In [7]:
clean_reviews = clean_all(reviews, 'reviewText')
In [9]:
clean_reviews.head()
Out[9]:
reviewText
0 b i buy PRON first hp c in about or so and PRO...
1 b why this belate review i feel very obliged t...
2 b i have an hp gx that have be kick for more t...
3 b i ve start do more finance stuff recently an...
4 b for simple calculation and discount cash flo...

Form Bigrams & Trigrams

We want to identify bigrams and trigrams so we can concatenate them and consider them as one word. Bigrams are phrases containing 2 words e.g. ‘social media’, where ‘social’ and ‘media’ are more likely to co-occur rather than appear separately. Likewise, trigrams are phrases containing 3 words that more likely co-occur e.g. ‘Proctor and Gamble’. We use Pointwise Mutual Information score to identify significant bigrams and trigrams to concatenate. We also filter bigrams or trigrams with the filter (noun/adj, noun), (noun/adj,all types,noun/adj) because these are common structures pointing out noun-type n-grams. This helps the LDA model better cluster topics.

In [10]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_documents([comment.split() for comment in clean_reviews.reviewText])
# Filter only those that occur at least 50 times
finder.apply_freq_filter(50)
bigram_scores = finder.score_ngrams(bigram_measures.pmi)
In [11]:
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = nltk.collocations.TrigramCollocationFinder.from_documents([comment.split() for comment in clean_reviews.reviewText])
# Filter only those that occur at least 50 times
finder.apply_freq_filter(50)
trigram_scores = finder.score_ngrams(trigram_measures.pmi)
In [12]:
bigram_pmi = pd.DataFrame(bigram_scores)
bigram_pmi.columns = ['bigram', 'pmi']
bigram_pmi.sort_values(by='pmi', axis = 0, ascending = False, inplace = True)
In [13]:
trigram_pmi = pd.DataFrame(trigram_scores)
trigram_pmi.columns = ['trigram', 'pmi']
trigram_pmi.sort_values(by='pmi', axis = 0, ascending = False, inplace = True)
In [47]:
# Filter for bigrams with only noun-type structures
def bigram_filter(bigram):
    tag = nltk.pos_tag(bigram)
    if tag[0][1] not in ['JJ', 'NN'] and tag[1][1] not in ['NN']:
        return False
    if bigram[0] in stop_word_list or bigram[1] in stop_word_list:
        return False
    if 'n' in bigram or 't' in bigram:
        return False
    if 'PRON' in bigram:
        return False
    return True
In [48]:
# Filter for trigrams with only noun-type structures
def trigram_filter(trigram):
    tag = nltk.pos_tag(trigram)
    if tag[0][1] not in ['JJ', 'NN'] and tag[1][1] not in ['JJ','NN']:
        return False
    if trigram[0] in stop_word_list or trigram[-1] in stop_word_list or trigram[1] in stop_word_list:
        return False
    if 'n' in trigram or 't' in trigram:
         return False
    if 'PRON' in trigram:
        return False
    return True 
In [49]:
# Can set pmi threshold to whatever makes sense - eyeball through and select threshold where n-grams stop making sense
# choose top 500 ngrams in this case ranked by PMI that have noun like structures
filtered_bigram = bigram_pmi[bigram_pmi.apply(lambda bigram:\
                                              bigram_filter(bigram['bigram'])\
                                              and bigram.pmi > 5, axis = 1)][:500]

filtered_trigram = trigram_pmi[trigram_pmi.apply(lambda trigram: \
                                                 trigram_filter(trigram['trigram'])\
                                                 and trigram.pmi > 5, axis = 1)][:500]


bigrams = [' '.join(x) for x in filtered_bigram.bigram.values if len(x[0]) > 2 or len(x[1]) > 2]
trigrams = [' '.join(x) for x in filtered_trigram.trigram.values if len(x[0]) > 2 or len(x[1]) > 2 and len(x[2]) > 2]
In [54]:
# examples of bigrams
bigrams[:10]
Out[54]:
['hewlett packard',
 'wal mart',
 'akt le',
 'http www',
 'carpal tunnel',
 'snow leopard',
 'allen wrench',
 'trapper keeper',
 'stanley bostitch',
 'recommended cfh']
In [52]:
# examples of trigrams
trigrams[:10]
Out[52]:
['http www amazon',
 'epson labelworks lw',
 'www amazon com',
 'cyan magenta yellow',
 'thermal laminating pouch',
 'adhesive dot roller',
 'wilson jones heavy',
 'black cyan magenta',
 'automatic document feeder',
 'amazon vine program']
In [55]:
# Concatenate n-grams
def replace_ngram(x):
    for gram in trigrams:
        x = x.replace(gram, '_'.join(gram.split()))
    for gram in bigrams:
        x = x.replace(gram, '_'.join(gram.split()))
    return x
In [57]:
reviews_w_ngrams = clean_reviews.copy()
In [58]:
reviews_w_ngrams.reviewText = reviews_w_ngrams.reviewText.map(lambda x: replace_ngram(x))
In [59]:
# tokenize reviews + remove stop words + remove names + remove words with less than 2 characters
reviews_w_ngrams = reviews_w_ngrams.reviewText.map(lambda x: [word for word in x.split()\
                                                 if word not in stop_word_list\
                                                              and word not in english_names\
                                                              and len(word) > 2])
In [60]:
reviews_w_ngrams.head()
Out[60]:
0    [buy, PRON, PRON, serve, PRON, faithfully, los...
1    [belate, review, feel, obliged, share, PRON, v...
2    [kick, year, year, old, flawless, month, numbe...
3    [start, finance, stuff, recently, look, good, ...
4    [simple, calculation, discount, cash, flow, go...
Name: reviewText, dtype: object

Filter for only nouns

Nouns are most likely indicators of a topic. For example, for the sentence ‘The store is nice’, we know the sentence is talking about ‘store’. The other words in the sentence provide more context and explanation about the topic (‘store’) itself. Therefore, filtering for the noun cleans the text for words that are more interpretable in the topic model.

In [62]:
# Filter for only nouns
def noun_only(x):
    pos_comment = nltk.pos_tag(x)
    filtered = [word[0] for word in pos_comment if word[1] in ['NN']]
    # to filter both noun and verbs
    #filtered = [word[0] for word in pos_comment if word[1] in ['NN','VB', 'VBD', 'VBG', 'VBN', 'VBZ']]
    return filtered
In [63]:
final_reviews = reviews_w_ngrams.map(noun_only)

LDA Model

In [67]:
dictionary = corpora.Dictionary(final_reviews)
In [68]:
doc_term_matrix = [dictionary.doc2bow(doc) for doc in final_reviews]

Optimize # of k topics

LDA requires that we specify the number of topics that exists in a corpus of text. There are several common measures that can be optimized, such as predictive likelihood, perplexity, and coherence. Much literature has indicated that maximizing coherence, particularly a measure named Cv (https://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf), leads to better human interpretability. This measure assesses the interpretability of topics given the set of words in generated topics. Therefore, we will optimize this measure.

Since eval_only calculates perplexity metric, we can set it to None to save time, as we will use a different metric called Cv.

Try 5-25 topics...

In [87]:
coherence = []
for k in range(5,25):
    print('Round: '+str(k))
    Lda = gensim.models.ldamodel.LdaModel
    ldamodel = Lda(doc_term_matrix, num_topics=k, id2word = dictionary, passes=40,\
                   iterations=200, chunksize = 10000, eval_every = None)
    
    cm = gensim.models.coherencemodel.CoherenceModel(model=ldamodel, texts=final_reviews,\
                                                     dictionary=dictionary, coherence='c_v')
    coherence.append((k,cm.get_coherence()))
Round: 5
Round: 6
Round: 7
Round: 8
Round: 9
Round: 10
Round: 11
Round: 12
Round: 13
Round: 14
Round: 15
Round: 16
Round: 17
Round: 18
Round: 19
Round: 20
Round: 21
Round: 22
Round: 23
Round: 24
In [88]:
x_val = [x[0] for x in coherence]
y_val = [x[1] for x in coherence]
In [89]:
plt.plot(x_val,y_val)
plt.scatter(x_val,y_val)
plt.title('Number of Topics vs. Coherence')
plt.xlabel('Number of Topics')
plt.ylabel('Coherence')
plt.xticks(x_val)
plt.show()

The improvement stops to significantly improve after 15 topics. It is not always where the highest Cv is, so we can try a couple to see which has the best result. We'll try 15 and 23 here. Adding topics can help reveal further sub topics. Nonetheless, if the same words start to appear across multiple topics, the number of topics is too high.

In [113]:
Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(doc_term_matrix, num_topics=15, id2word = dictionary, passes=40,\
               iterations=200,  chunksize = 10000, eval_every = None, random_state=0)
In [155]:
Lda2 = gensim.models.ldamodel.LdaModel
ldamodel2 = Lda2(doc_term_matrix, num_topics=23, id2word = dictionary, passes=40,\
               iterations=200,  chunksize = 10000, eval_every = None, random_state=0)

Some explanation of the parameters that can be tuned:

  • Passes: The number of times model iterates through the whole corpus
  • Iterations: The number of iterations the model trains on each pass
  • Chunk size: Number of rows that are taken to train the model each

Other explanation of parameters: https://radimrehurek.com/gensim/models/ldamodel.html

In [142]:
# To show initial topics
ldamodel.show_topics(15, num_words=10, formatted=False)
Out[142]:
[(0,
  [('note', 0.05554076),
   ('color', 0.051286757),
   ('use', 0.048626028),
   ('post', 0.04068646),
   ('marker', 0.033643845),
   ('board', 0.028574942),
   ('work', 0.0190825),
   ('magnet', 0.01893561),
   ('pad', 0.015846577),
   ('stick', 0.01408235)]),
 (1,
  [('folder', 0.08729111),
   ('tab', 0.085292436),
   ('chair', 0.06593559),
   ('file', 0.06276117),
   ('use', 0.02441812),
   ('file_folder', 0.022120677),
   ('office', 0.021025151),
   ('color', 0.018232403),
   ('seat', 0.014802855),
   ('product', 0.013404625)]),
 (2,
  [('binder', 0.09963777),
   ('pocket', 0.045993555),
   ('paper', 0.043687668),
   ('cover', 0.03973441),
   ('use', 0.028761666),
   ('notebook', 0.02356954),
   ('plastic', 0.019488329),
   ('divider', 0.0129221305),
   ('sheet', 0.010436761),
   ('look', 0.009696761)]),
 (3,
  [('pencil', 0.046214968),
   ('use', 0.039241314),
   ('work', 0.017876107),
   ('hand', 0.017292611),
   ('point', 0.015187497),
   ('mouse', 0.012842767),
   ('feel', 0.012571061),
   ('time', 0.011546981),
   ('gel', 0.010329655),
   ('sharpener', 0.010041249)]),
 (4,
  [('paper', 0.29295966),
   ('sheet', 0.057533994),
   ('cut', 0.027506119),
   ('line', 0.0265265),
   ('use', 0.014041655),
   ('edge', 0.009124642),
   ('time', 0.008825456),
   ('scissor', 0.008782924),
   ('cutter', 0.008229269),
   ('try', 0.007084295)]),
 (5,
  [('label', 0.1723103),
   ('use', 0.053069115),
   ('print', 0.042996094),
   ('avery', 0.02533661),
   ('product', 0.023779038),
   ('template', 0.019389177),
   ('printer', 0.016582226),
   ('sheet', 0.016408889),
   ('work', 0.014924869),
   ('design', 0.013208105)]),
 (6,
  [('box', 0.06719012),
   ('use', 0.043874178),
   ('stapler', 0.028905349),
   ('laminator', 0.020024901),
   ('work', 0.019376948),
   ('staple', 0.017313404),
   ('size', 0.016116722),
   ('time', 0.014754222),
   ('sheet', 0.01393789),
   ('machine', 0.013884576)]),
 (7,
  [('printer', 0.090115316),
   ('print', 0.0597321),
   ('use', 0.029410886),
   ('paper', 0.018815335),
   ('photo', 0.018737754),
   ('color', 0.016101355),
   ('epson', 0.015446978),
   ('quality', 0.014941996),
   ('work', 0.014857389),
   ('scanner', 0.014041543)]),
 (8,
  [('desk', 0.034735076),
   ('use', 0.022344066),
   ('wall', 0.017067196),
   ('work', 0.016676415),
   ('place', 0.015774151),
   ('look', 0.014846028),
   ('product', 0.012962775),
   ('holder', 0.012792616),
   ('mount', 0.012480653),
   ('thing', 0.012047083)]),
 (9,
  [('device', 0.04520504),
   ('button', 0.044987217),
   ('power', 0.03331916),
   ('use', 0.025703644),
   ('battery', 0.023691626),
   ('unit', 0.021187603),
   ('iphone', 0.017915428),
   ('wifi', 0.017026613),
   ('plug', 0.016983017),
   ('ipad', 0.016728291)]),
 (10,
  [('card', 0.05332803),
   ('envelope', 0.04079895),
   ('use', 0.032934155),
   ('look', 0.024511479),
   ('letter', 0.022425689),
   ('gift', 0.022383856),
   ('product', 0.019586928),
   ('mail', 0.01848597),
   ('design', 0.014794249),
   ('coupon', 0.014493796)]),
 (11,
  [('tape', 0.1346477),
   ('use', 0.045706317),
   ('dispenser', 0.029944325),
   ('roll', 0.02497533),
   ('scotch', 0.019148348),
   ('package', 0.018629175),
   ('product', 0.017392516),
   ('box', 0.014107367),
   ('work', 0.013217546),
   ('stick', 0.012427195)]),
 (12,
  [('ink', 0.0934071),
   ('price', 0.069271475),
   ('cartridge', 0.043713454),
   ('quality', 0.029901318),
   ('cost', 0.027726078),
   ('color', 0.023341969),
   ('use', 0.023152564),
   ('product', 0.021041602),
   ('work', 0.02068136),
   ('purchase', 0.018761719)]),
 (13,
  [('phone', 0.08554327),
   ('number', 0.026325107),
   ('unit', 0.0260776),
   ('use', 0.025917714),
   ('feature', 0.020520078),
   ('work', 0.017577784),
   ('scale', 0.015635148),
   ('time', 0.014621188),
   ('service', 0.014241095),
   ('base', 0.014212798)]),
 (14,
  [('use', 0.043373954),
   ('thing', 0.02428259),
   ('book', 0.023880767),
   ('work', 0.022635672),
   ('year', 0.020149278),
   ('shredder', 0.019987088),
   ('time', 0.018705543),
   ('school', 0.017381074),
   ('day', 0.013657781),
   ('home', 0.012003456)])]
In [144]:
# To show initial topics
ldamodel2.show_topics(23, num_words=10, formatted=False)
Out[144]:
[(0,
  [('color', 0.13042007),
   ('use', 0.04939971),
   ('marker', 0.049138594),
   ('ink', 0.045652367),
   ('write', 0.025547048),
   ('tip', 0.02120447),
   ('work', 0.01799116),
   ('sharpie', 0.015334195),
   ('line', 0.013950256),
   ('cap', 0.01391556)]),
 (1,
  [('tab', 0.12561414),
   ('folder', 0.105585515),
   ('file', 0.07379385),
   ('use', 0.044929538),
   ('file_folder', 0.028123518),
   ('color', 0.027827552),
   ('document', 0.018135134),
   ('divider', 0.017267503),
   ('product', 0.014797443),
   ('work', 0.012500131)]),
 (2,
  [('binder', 0.13120762),
   ('pocket', 0.060566276),
   ('cover', 0.051677737),
   ('paper', 0.03549023),
   ('notebook', 0.029100694),
   ('use', 0.024918718),
   ('plastic', 0.022741886),
   ('spine', 0.01185665),
   ('look', 0.010828309),
   ('hold', 0.0097193085)]),
 (3,
  [('scanner', 0.037928365),
   ('software', 0.032020744),
   ('use', 0.02661672),
   ('device', 0.019739488),
   ('work', 0.018324248),
   ('document', 0.01643745),
   ('image', 0.013871932),
   ('option', 0.010788011),
   ('time', 0.010738234),
   ('app', 0.009616761)]),
 (4,
  [('line', 0.04810502),
   ('list', 0.032569233),
   ('cut', 0.025306059),
   ('section', 0.024392368),
   ('space', 0.01914207),
   ('use', 0.013752885),
   ('help', 0.012654899),
   ('word', 0.012498034),
   ('scissor', 0.011455204),
   ('time', 0.011337658)]),
 (5,
  [('label', 0.09545807),
   ('use', 0.038240943),
   ('label_maker', 0.0367063),
   ('size', 0.032777734),
   ('print', 0.02810787),
   ('option', 0.020425789),
   ('brother', 0.016577503),
   ('tape', 0.0145291425),
   ('dymo', 0.014353474),
   ('font', 0.013756974)]),
 (6,
  [('use', 0.055573765),
   ('stapler', 0.046579275),
   ('laminator', 0.03226891),
   ('work', 0.029730076),
   ('staple', 0.02789948),
   ('machine', 0.022236329),
   ('sheet', 0.020912863),
   ('time', 0.019748287),
   ('laminate', 0.018641733),
   ('office', 0.012587159)]),
 (7,
  [('printer', 0.11308775),
   ('print', 0.072835185),
   ('use', 0.03140155),
   ('color', 0.021372763),
   ('photo', 0.020456113),
   ('cartridge', 0.018776828),
   ('ink', 0.018743694),
   ('quality', 0.01753501),
   ('epson', 0.016801588),
   ('work', 0.014786961)]),
 (8,
  [('board', 0.057055254),
   ('magnet', 0.036343493),
   ('use', 0.03401388),
   ('wall', 0.029806193),
   ('work', 0.021534385),
   ('hang', 0.019245533),
   ('surface', 0.017176444),
   ('fridge', 0.015672252),
   ('size', 0.015249013),
   ('frame', 0.014354303)]),
 (9,
  [('button', 0.048002463),
   ('display', 0.04251696),
   ('battery', 0.04007704),
   ('power', 0.030529827),
   ('function', 0.022540586),
   ('use', 0.020540902),
   ('unit', 0.020411884),
   ('calculator', 0.019436345),
   ('press', 0.01764168),
   ('light', 0.015444978)]),
 (10,
  [('card', 0.07395634),
   ('picture', 0.033476386),
   ('fun', 0.030240938),
   ('gift', 0.030221147),
   ('use', 0.029920796),
   ('iphone', 0.029108062),
   ('look', 0.025734916),
   ('love', 0.024445744),
   ('design', 0.015442862),
   ('cute', 0.012419679)]),
 (11,
  [('tape', 0.17635933),
   ('use', 0.049484335),
   ('dispenser', 0.040408727),
   ('roll', 0.031698085),
   ('scotch', 0.023662435),
   ('package', 0.019743662),
   ('product', 0.016282355),
   ('work', 0.014740866),
   ('hand', 0.013856983),
   ('time', 0.011419797)]),
 (12,
  [('price', 0.08725204),
   ('ink', 0.04019026),
   ('quality', 0.03321029),
   ('product', 0.03177333),
   ('cost', 0.030540118),
   ('cartridge', 0.027756458),
   ('brand', 0.023580289),
   ('buy', 0.023349397),
   ('work', 0.02322344),
   ('purchase', 0.02301751)]),
 (13,
  [('phone', 0.10067777),
   ('shredder', 0.04033353),
   ('use', 0.028559031),
   ('work', 0.022776622),
   ('feature', 0.022676079),
   ('unit', 0.022664795),
   ('number', 0.021434642),
   ('time', 0.018948015),
   ('wifi', 0.016510718),
   ('service', 0.014403109)]),
 (14,
  [('use', 0.06622348),
   ('work', 0.030752605),
   ('school', 0.029984517),
   ('thing', 0.028314037),
   ('year', 0.02751095),
   ('home', 0.024571154),
   ('day', 0.019894343),
   ('office', 0.019291528),
   ('time', 0.015584958),
   ('month', 0.013385191)]),
 (15,
  [('note', 0.13345195),
   ('post', 0.0915535),
   ('book', 0.06520335),
   ('use', 0.04418773),
   ('stick', 0.023254061),
   ('color', 0.023015738),
   ('pad', 0.022719262),
   ('product', 0.0217553),
   ('mark', 0.019886099),
   ('sticky_note', 0.01667514)]),
 (16,
  [('paper', 0.38546053),
   ('sheet', 0.0729975),
   ('envelope', 0.023417415),
   ('size', 0.013255699),
   ('piece', 0.010087976),
   ('photo', 0.008301831),
   ('quality', 0.008212537),
   ('airprint', 0.0070669306),
   ('letter', 0.0069985334),
   ('stack', 0.0069163973)]),
 (17,
  [('use', 0.053630367),
   ('surface', 0.051243808),
   ('pad', 0.037238825),
   ('mouse', 0.033375982),
   ('work', 0.024146406),
   ('product', 0.021731285),
   ('glue', 0.019381285),
   ('wrist', 0.019195158),
   ('keyboard', 0.017723132),
   ('mouse_pad', 0.017507704)]),
 (18,
  [('label', 0.16575454),
   ('use', 0.05483286),
   ('print', 0.044304166),
   ('product', 0.026850592),
   ('avery', 0.026732037),
   ('template', 0.022258537),
   ('printer', 0.01962065),
   ('sheet', 0.018373849),
   ('work', 0.017547501),
   ('design', 0.015463634)]),
 (19,
  [('box', 0.18429852),
   ('item', 0.03778836),
   ('use', 0.026292147),
   ('storage', 0.025811281),
   ('store', 0.025140334),
   ('size', 0.021184357),
   ('cardboard', 0.017525708),
   ('pack', 0.015550314),
   ('thing', 0.011347521),
   ('handle', 0.009608397)]),
 (20,
  [('pencil', 0.11590351),
   ('use', 0.031844214),
   ('sharpener', 0.025182469),
   ('point', 0.023583025),
   ('lead', 0.02334912),
   ('scale', 0.022006663),
   ('work', 0.01728826),
   ('clip', 0.016115276),
   ('hand', 0.015239838),
   ('grip', 0.0142835155)]),
 (21,
  [('product', 0.04753347),
   ('holder', 0.04566283),
   ('letter', 0.029462313),
   ('look', 0.023575727),
   ('drawer', 0.022021983),
   ('case', 0.01852576),
   ('mail', 0.01749215),
   ('design', 0.017376171),
   ('place', 0.017036516),
   ('item', 0.016385237)]),
 (22,
  [('desk', 0.06657937),
   ('chair', 0.04610483),
   ('monitor', 0.023994518),
   ('sit', 0.020361725),
   ('arm', 0.01725616),
   ('look', 0.0159833),
   ('use', 0.015491127),
   ('work', 0.015354036),
   ('office', 0.015131725),
   ('foot', 0.01365214)])]

23 topics yielded clearer results, so we'll go with this...

Relevancy

Sometimes, words that are ranked as top words for a given topic may be ranked high because they are globally frequent across text in a corpus. Relevancy score helps to prioritize terms that belong more exclusively to a given topic. This can increase interpretability even more. The relevance of term w to topic k is defined as:

$ r(w,k| \lambda) = \lambda log(\phi_{kw}) +(1-\lambda)log(\frac{\phi_{kw}}{p_{kw}}) $

where $\phi_{kw}$ is the probability of term w in topic k and $\frac{\phi_{kw}}{p_{kw}}$ is lift in term's probability within a topic to its marginal probability across the corpus (this helps discards globally frequent terms). A lower lambda value gives more importance to the second term, which gives more importance to topic exclusivity. We can use Python’s pyLDAvis for this. For example, when lowering lambda, we can see that topic 0 ranked terms that are even more relevant to the topic of hair salon service the highest.

Paper behind this tool: https://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf

In [157]:
topic_data =  pyLDAvis.gensim.prepare(ldamodel2, doc_term_matrix, dictionary, mds = 'pcoa')
pyLDAvis.display(topic_data)
Out[157]:

The pyLDAvis tool also gives two other important pieces of information. The circles represent each topic. The distance between the circles visualizes how related topics are to each other. The above plot shows that our topics are quite distinct. The dimensionality reduction can be chosen as PCA or t-sne. The above example uses t-sne. Additionally, the size of the circle represents how prevalent that topic is across the corpus of reviews. For example, topic 1 in this case makes up the majority of the documents, constituting 17.1% of the tokens.

To extract the words for a given lambda:

In [153]:
all_topics = {}
num_terms = 10 # Adjust number of words to represent each topic
lambd = 0.6 # Adjust this accordingly based on tuning above
for i in range(1,24): #Adjust this to reflect number of topics chosen for final LDA model
    topic = topic_data.topic_info[topic_data.topic_info.Category == 'Topic'+str(i)].copy()
    topic['relevance'] = topic['loglift']*(1-lambd)+topic['logprob']*lambd
    all_topics['Topic '+str(i)] = topic.sort_values(by='relevance', ascending=False).Term[:num_terms].values
In [154]:
pd.DataFrame(all_topics).T
Out[154]:
0 1 2 3 4 5 6 7 8 9
Topic 1 printer print photo epson cartridge ink canon color wireless fax
Topic 2 scanner software device image app document tablet program option ipad
Topic 3 price cost ink buy amazon brand purchase pay cartridge quality
Topic 4 school use year home child day thing kid organizer son
Topic 5 stapler laminator staple laminate swingline machine use heat sheet lamination
Topic 6 tape dispenser roll scotch package use wrap strip packing_tape packaging
Topic 7 label avery template print peel address use ticket product sticker
Topic 8 color marker write ink tip sharpie cap erase blue skin
Topic 9 board magnet wall hang fridge frame door refrigerator mount surface
Topic 10 desk chair monitor arm sit foot seat password position leg
Topic 11 binder pocket cover notebook spine plastic ring paper divider hinge
Topic 12 paper sheet envelope airprint security control_panel card_stock pastel ream cardstock
Topic 13 phone shredder wifi handset sound service number unit cell_phone message
Topic 14 note post book sticky_note mark pad stick flag use cloud
Topic 15 box storage item cardboard store banker_box pack handle size bubble_wrap
Topic 16 surface mouse pad wrist glue mouse_pad keyboard bottle rest use
Topic 17 pencil sharpener scale lead pencil_sharpener cup barrel mechanical_pencil point clip
Topic 18 line list section cut scissor speaker cutter activity trimmer blade
Topic 19 tab folder file file_folder divider paperwork filing tax file_cabinet use
Topic 20 card iphone gift fun picture cute love art business_card shirt
Topic 21 display battery button calculator power function key press ounce light
Topic 22 holder letter drawer mail product hanging_folder filing_cabinet slot stamp monitor_arm
Topic 23 label_maker label dymo font labeler symbol comb border brother versatile

We can see here that most topics are quite clear. In this case, they are clustered into the different products being talked about in the reviews. Some clear topics are:

  • Printer products
  • Scanner proudcts
  • Printer Ink, pricing/quality
  • School stationary
  • Lamination
  • Packing tape
  • Mailing labels
  • Markers
  • Magnetic boards
  • Workstation
  • Binder
  • Paper products
  • Phone
  • Post-its
  • Storage/packing materials
  • Mouse/keyboard
  • Pencil
  • Cutting
  • File Organizer
  • Business card
  • Calculator
  • Mailing materials
  • Label maker

Other models:

  • Non-negative matrix factorization (NMF): Less computationally expensive, but assumes fixed probability vectors of multinomials across documents, while LDA allows this to vary. If we believe the topic probabilities should remain fixed for each document, or in small data setting where additional variability from priors is too much, NMF might be better.
  • Latent semantic indexing: a variant of PCA/SVD. Again, much faster, but generally less accurate. Similar to NMF, but different loss constraint.
  • Hierarchical Dirchlet Process: Uses posterior inference to determine the number of topics for you

https://www.cs.cmu.edu/~epxing/Class/10708-15/slides/LDA_SC.pdf
https://en.wikipedia.org/wiki/Hierarchical_Dirichlet_process

In [ ]: