What is TF-IDF
TF-IDF = Term Frequency — Inverse Document Frequency.
If you’ve been busy creating content (a good thing), rather than researching the latest SEO strategies, you may not be familiar with TF-IDF. It is a way to evaluate and improve your content’s topical relevance.
I first crossed paths with TF-IDF over 10 years ago. In my corporate life, I was the US General Manager of small software division of Mitsubishi Electric. One of the things we were developing was a search tool based on a Vector Space Model (VSM) to mine large volumes of data for topical relevancy. TF-IDF was used to determine the weights of the vectors that VSM used.
Mathematically, TF-IDF is the result of calculating how often a keyword is on a page (TF) and how often it appears in a larger set of documents (IDF). TF-IDF is well known and validated – it is not some new fad that may fade. It is an evolution beyond things like latent semantic indexing (LSI). There is plenty of information on the math and science behind TF-IDF.
Best to leave all that to the data scientists though — let’s stick with practical usage.
How is TF-IDF Used in SEO
The idea is to find topically-relevant terms. With TD-IDF you are trying to increase the topical relevance of your pages. Google search also needs to be able to determine the topical relevance of words that are spelled the same but have different meanings (homonyms) …
- There was a lovely flock of cranes at the shoreline.
- A crane was used to lift the 10-ton steel girder into place.
- She was seated behind a pole and had to crane her neck to see the ballgame.
TF-IDF is a way to calculate the importance of a keyword by comparing its frequency in your page to the same keywords frequency in a larger set of documents.
The steps a typical TF-IDF SEO tool would use would include …
- Input the keyword and the page to be researched
- Fetch Google’s results and get the 10 top ranking competitors.
- Parse the content of each of the competitors
- Extract relevant keywords from the pages
- Calculate the TF-IDF for each competitive page
- Calculates the TF-IDF for the same terms on your page;
- Create a table of the TF-IDF for the page to be optimized and the those of the 10 competitors
- For each term extracted suggest whether to add more, remove some or leave as is on your page
Tools to Analyze TF-IDF for SEO
There are a number of TF-IDF tools available. Here is a short list of free tools you can check out.
My preferred tool is the Website Auditor module of Link Assistant. This is not a SaaS product. It runs on your desktop. It is available for Windows, MacOS, and Linux. I wanted to do a simple test of TF-IDF for an article.
I used the term “affiliate SEO” and the page Affiliate SEO: On-Page & Off-Page How to Improve Your Search Position as the keyword and document to research. Before any TF-IDF optimization, I danced between page 2 & 3 for this term (SerpRobot) and want to get this page higher up in the serps. You can see from the Website Auditor graph below that there are a number of recommendations.
I will make all of the recommended changes and will track this document’s ranking over the next few months. I will update this page with the results. Of course, this is an anecdotal, non-scientific test, but it is a real-life situation many affiliates will encounter.
TF-IDF for New Content
The example above gives you an idea of how you might use TF-IDF for improving the ranking of existing content. But you can also use it to help guide new content development. One of the first things you will notice if you do TF-IDF research on a group of pages is that some of the keywords seem almost unrelated. This is one of the big advantages of using these tools. Your basic keyword research tool is not going to provide these keywords.
Website auditor wants a keyword and your document to analyze. For doing research to develop content I use SEObility who gives you three free searches per day.
The graph creates from entering the keyword anchor text. SEObility found the top 10 results and did the TF-IDF analysis. You are then able to review the top 50 keywords for each of the ten sites. In this case, one thing I noticed is that penguin showed up in most of the articles (Google not the Antarctic bird). This would prompt me to be sure to cover the Google Penguin update in my article.
TF-IDF is another useful tool for improving your on-page SEO. It will show you the frequency of keywords in a document and how they relate to that term in competitive documents. It will uncover keywords that you may not have known were topically relevant. Improving your content by including this related information and keywords should help rank your pages higher.
It is not a magic pill though.
The biggest issue I see is the size of the dataset you are comparing it to. Even if Google were also using some TF-IDF derived calculations, they’re looking at a different set of documents. Also, TF-IDF is designed and validated to work on large sets of documents. The results when comparing your keyword frequency to 10 competitive documents is questionable.
There are no downsides and the process of focusing on TF-IDF improves your content and broadens it to a larger audience.
TD-IDF advantages include …
- TF-IDF gets you to focus on your content quality relative to your competitors.
- It shows you if you are over or under-using a keyword
- TF-IDF shows you missing relevant keywords
- It is useful for both improving existing content and selecting new topics
Google ranking has at least 200 ranking criteria so you won’t crush it just by adding TF-TDF. But, like Google, your on/page SEO should be a rich set of optimization and TF-IDF Is a useful tool to help improve your topical relevance and page rank.