What is TF*IDF
TF*IDF = Term Frequency — Inverse Document Frequency.
If you’ve been busy creating content (a good thing), rather than researching the latest SEO strategies, you may not be familiar with TF-IDF. It is a way to evaluate and improve your content’s topical relevance.
I first crossed paths with TF-IDF over 10 years ago. In my corporate life, I was the US General Manager of small software division of Mitsubishi Electric. One of the things we were developing was a search tool based on a Vector Space Model (VSM) to mine large volumes of data for topical relevancy. TF-IDF was used to determine the weights of the vectors that VSM used.
Mathematically, TF-IDF is the result of calculating how often a keyword is on a page (TF) and how often it appears in a larger set of documents (IDF). TF-IDF is well known and validated – it is not some new fad that may fade. It is an evolution beyond things like latent semantic indexing (LSI). There is plenty of information on the math and science behind TF-IDF.
Best to leave all that to the data scientists though — let’s stick with practical usage.
Does Google use TF*IDF?
It is not likely that Google uses TF-IDF as a ranking signal. They have said that they use it for eliminating stop words. Here is Google’s John Mueller on the topic …
How is TF*IDF Used in SEO
The idea is to find topically-relevant terms. With TF*IDF you are trying to increase the topical relevance of your pages. Google search also needs to be able to determine the topical relevance of words that are spelled the same but have different meanings (homonyms) …
- There was a lovely flock of cranes at the shoreline.
- A crane was used to lift the 10-ton steel girder into place.
- She was seated behind a pole and had to crane her neck to see the ballgame.
TF-IDF is a way to calculate the importance of a keyword by comparing its frequency in your page to the same keywords frequency in a larger set of documents.
The steps a typical TF*IDF SEO tool would use would include …
- Input the keyword and the page to be researched
- Fetch Google’s results and get the 10 top ranking competitors.
- Parse the content of each of the competitors
- Extract relevant keywords from the pages
- Calculate the TF-IDF for each competitive page
- Calculates the TF-IDF for the same terms on your page;
- Create a table of the TF-IDF for the page to be optimized and the those of the 10 competitors
- For each term extracted suggest whether to add more, remove some or leave as is on your page
Tools to Analyze TF-IDF for SEO
There are a number of TF-IDF tools available. Here is a short list of free tools you can check out.
Surfer SEO is a close second for me. They don’t technically use TF-IDF but a simial term frequency method.
I used the term “affiliate SEO” and the page Affiliate SEO: On-Page & Off-Page How to Improve Your Search Position as the keyword and document to research. Before any TF-IDF optimization, I danced between page 2 & 3 for this term (SerpRobot) and want to get this page higher up in the SERPs. You can see from the Website Auditor graph below that there are a number of recommendations.
Go to our affiliate on-page SEO guide for related SEO tips..
I will make all of the recommended changes and will track this document’s ranking over the next few months. I will update this page with the results. Of course, this is an anecdotal, non-scientific test, but it is a real-life situation many affiliates will encounter.
TF-IDF for New Content
The example above gives you an idea of how you might use TF-IDF for improving the ranking of existing content. But you can also use it to help guide new content development. One of the first things you will notice if you do TF-IDF research on a group of pages is that some of the keywords seem almost unrelated. This is one of the big advantages of using these tools. Your basic keyword research tool is not going to provide these keywords.
Website auditor wants a keyword and your document to analyze. For doing research to develop content I use SEObility who gives you three free searches per day.
The graph creates from entering the keyword anchor text. SEObility found the top 10 results and did the TF-IDF analysis. You are then able to review the top 50 keywords for each of the ten sites. In this case, one thing I noticed is that penguin showed up in most of the articles (Google not the Antarctic bird). This would prompt me to be sure to cover the Google Penguin update in my article.
What is TF-IDF?
TF-IDF stands for term frequency-inverse document frequency. It is an informational retrieval technique to evaluate topical relevance.
How is TF-IDF used in SEO environments?
One of the most common uses of TF-IDF is to compare your content to content ranking above you and model your content to be similar to your competitors.
What tools use TF-IDF or a similar methodology?
There are a number of SEO tools that will analyze the keywords in your content vs. your competitor's including ... CORA, Page Optimizer, Website Auditor, and Surfer SEO.
Does Google use TF*IDF?
They may be using it for some tasks like removing stop words but it is not a ranking signal
TF*IDF is another useful tool for improving your on-page SEO. It will show you the frequency of keywords in a document and how they relate to that term in competitive documents. Basically it is making keyword density recommendations. It will uncover keywords that you may not have known were topically relevant. Improving your content by including this related information and keywords should help rank your pages higher.
It is not a magic pill though.
The biggest issue I see is the size of the dataset you are comparing it to. Even if Google was also using some TF*IDF derived calculations, they’re looking at a different set of documents. Also, TF*IDF is designed and validated to work on large sets of documents. The results when comparing your keyword frequency to 10 competitive documents is questionable.
There are no downsides and the process of focusing on TF*IDF improves your content and broadens it to a larger audience.
TD-IDF advantages include …
- TF*IDF gets you to focus on your content quality relative to your competitors.
- It shows you if you are over or under-using a keyword
- TF*IDF shows you missing relevant keywords
- It is useful for both improving existing content and selecting new topics