We all know that Search Engine Optimisation (SEO) is not new. It is now a well-established and many faceted field which has undergone many changes. The changing way that search engines assess and rank pages has led to many of those twist and turns in SEO (have a look at this article on the cost of SEO).
Link-building is another thing which rose to significant prominence. That happened once the importance of backlinks became clear. More technical SEO came to the fore when the informational architecture of any given site was recognised as a crucial factor.
The importance of content then became plain. Search engines made it known that sites should have high quality, relevant content. At first, that led SEO professionals to go running for their keyword density and keyword planning tools. In short order, it became apparent that that wasn’t the best way to go. Or at least, that using those tools for shady practices like keyword stuffing wasn’t going to work. It wouldn’t fool search engine algorithms.
Google and other search engines are looking for actual high-quality content. They reward content which is truly relevant to its supposed subject matter and which answers a user’s desired intent. Creating such content was the main piece of advice from Google for recovering from their famous – or infamous – Medic Update.
It’s obvious as a result, that Google and other search engines can accurately assess and evaluate the subject and meaning of content. One way in which they do this is by employing tf-idf. Tf-idf is one of the oldest ranking factors used by search engines. At its simplest level, it allows them to understand what pages are about.
This ultimate guide to tf-idf for SEO will give you all the information you could possibly need. It will cover what tf-idf is and how it works, how tf-idf relates to SEO and how and when you can utilise tf-idf analysis.
Tf-idf is a numerical statistic used in information retrieval. It represents how important a word or phrase is to a given document, as compared to other documents in a collection or ‘corpus’. A tf-idf value increases proportionally to the number of times a word or phrase appears in a document.
That is then offset by the number of times that word or phrase appears across all documents in the corpus. This is important as it adjusts for the fact that some words appear more often in general usage.
Take the example of a search term like ‘the best SEO’. ‘The’ is a word which would appear many times in all documents across a corpus. As a result, it is less important to a td-idf value if ‘the’ appears in the searched document than if the other less common words do.
Tf-idf is the product of two statistics. Meaning that you multiply one by the other. That is how it represents the importance of a word or phrase and offsets the general frequency of that word or phrase. The two statistics are Term Frequency (tf) and Inverse Document Frequency (idf).
Term frequency is the simpler half of tf-idf. It represents how often a term appears in a given document. All that’s needed to work out term frequency is the word length of the document and the number of times the term appears. You then divide the number of times the word appears by the total word count. That means that term frequency will always be a value between zero and one.
At the simplest possible level, term frequency is worked out in the following way:
TF (Term Frequency) = t (Number of times term appears in a document) / d (total words in document)
By considering the length of the document and the number of times the term appears, you get a fair idea of how relevant the document is to the given term. You can’t know for sure, though, unless you know how often the term appears in documents in general. That’s where Inverse Document Frequency (idf) comes in.
Words that are used very often across many documents are not good for determining which documents are relevant to a specific search term. Inverse Document Frequency is a statistic which lessens the weight placed on those common terms.
It ensures that if you’re searching for ‘the quick brown fox’, ‘the’ appearing many times in a document, will not matter as much as if the other words are present. The Inverse Document Frequency is a measure of how much information a word or term provides.
The formula for working out idf looks quite complicated:
IDF = log (Nd / fi)
If you break it down into its parts, it’s not that complex.
Log is simply a mathematical function that it’s not too important to understand. You can just press the ‘log’ button on a calculator if you ever need to. ‘Nd’ is the number of documents in the collection or corpus being searched. ‘fi’ is the number of those documents that contain the search term.
You get your IDF value, then, by dividing the number of documents by the number of documents with the search term and then applying the log function.
We can now take what we’ve learnt and use it for a very simple example. Say you have a 100 word document and you search it for the word ‘keyword’. If that word appears three times, you can work out the term frequency as follows:
3(number of terms in document) / 100 (total words) = 0.03
Your term frequency is 0.03. Now say there are a total of ten million documents in the corpus you search and ‘keyword’ appears in 1,000 of them. You now have all you need to work out your idf:
Log(10,000,000 / 1,000) = 4
Your inverse document frequency is 4. A tf-idf value is simply term frequency multiplied by idf, so:
0.03 (tf) x 4 (idf) = 0.12
Your tf-idf value is 0.12. That on its own doesn’t tell you much, but can be compared to other values. The higher the tf-idf value the more significant a term is to the given document. The highest tf-idf values result when there is a high term frequency and a low number of documents featuring the term in a corpus. The following table should help demonstrate this:
Term Frequency (TF)
|Corpus Size (Nd)||
Documents with Term (fi)
Inverse Document Frequency (IDF)
Tf-idf is most often used as part of Latent Semantic Indexing (LSI). This is certainly what directly connects tf-idf and SEO. LSI with tf-idf is a technique for processing language. It allows for the ranking of documents based on relevance to an individual search term or a broader topic area.
LSI works by identifying patterns in the relationships between different phrases and concepts in unstructured collections of text. It is based on the idea that words used in the same contexts tend to have related or similar meanings.
By establishing the patterns between terms and phrases, LSI makes it possible to discern the general topic or subject of a body of text. When LSI with tf-idf is applied to a corpus of documents, a query or search term will return more accurate results.
That’s because the results will include documents conceptually similar in meaning to the search. That will be the case even if the documents don’t contain specific words from the search term. The goal of LSI with tf-idf is to make sense of the actual subjects and focuses of a corpus of documents.
In short, tf-idf when used as part of LSI lets machines understand what pages of text are about. It, therefore, is how Google and other search engines can assess the relevance and usefulness of content.
The importance of tf-idf to SEO is certainly becoming clearer. It is one of the earliest search engine ranking factors and can even be seen as a key building block of search engines and SERPs. More importantly, tf-idf helps Google assess the actual relevance and utility of pages as related to any search term or query.
That begs the question of how our better understanding of tf-idf can be used for SEO. Whether that be by a SaaS SEO agency or a small business owner looking to boost organic traffic. A J Ghergich had his say in a SEMrush video on the topic:
‘The overall goal of tf-idf is to statistically measure how important a word is in a collection of documents. It’s like a really useful keyword density tool on steroids.’
That’s a neat little analogy, but it might be a little misleading. Tf-idf analysis is not best used to identify keywords to insert into content. It’s better to think of it as a kind of content inspiration tool.
Using tf-idf to compare your own content to similar pages which rank better, can give you suggestions as to how to enrich the content. It will point to keywords and phrases for which the higher ranked content scores better tf-idf values than your pages.
That will show which subject areas and topics your content doesn’t cover in as much detail or as well as similar pages. You then have a roadmap for how to improve your content in a way that Google is sure to like. That is by enhancing its relevance and how well it satisfies the intent of would-be readers who are searching for particular keywords or phrases.
Using TF-IDF for SEO is not about keyword density. It moves well beyond that.
Performing a tf-idf analysis does reveal terms and phrases which your content doesn’t deal with as well as other pages. Your next step then is not to start inserting those phrases within your existing content to boost the keyword density. What you want to do is to optimise your content so that it’s more relevant to the topics and subjects surrounding those phrases.
You might, for instance, have a page with SEO as its main topic. A tf-idf analysis may reveal that it has less value for the term ‘link-building’ than other pages which rank highly for SEO searches. That tells you that your content doesn’t give enough relevant, useful information about link-building. As simply as that, you have a definite way of improving your content.
Before you can think about improving your content, you need to know how to perform a tf-idf analysis. Let’s deal with that right now.
It is technically possible to run a tf-idf analysis by hand, performing your own calculations. Whilst possible, it is not advised. As you’ve already seen, calculations can get a little complicated and will always take time.
That’s not even the biggest problem. A tf-idf analysis is only worthwhile if the corpus you compare content against is relevant and useful. You want to be able to compare your content’s tf-idf values against other pages which rate well for your important keywords. That’s where a tf-idf tool, such as that offered by Ryte, comes in.
Ryte’s tool can compare a live URL from your site with the top ten Google search results for a given keyword or search query. It will then provide a list of important related terms and phrases for which the highly ranked content has a high tf-idf value.
On top of that, Ryte’s tool will also rate your chosen URL against those phrases and terms. It will show whether your content has as high, higher or lower tf-idf values for each of them.
That information will show you where and how your content needs to improve. It will give you the topics and subjects which your page doesn’t cover efficiently enough. You will, therefore, be able to tailor the page to better suit the needs and intents of its readers.
You’re probably now wondering when you ought to use tf-idf analysis. There are plenty of other things that also need doing, after all, within the field of SEO and outside.
There is never a bad time to think about improving your site’s content. There are also only so many hours in the day. That means that its best to implement tf-idf analysis in the circumstances where it’s most likely to make a difference. There are a handful of examples of just such circumstances;
Tf-idf can be really useful if you have a page that consistently ranks on the second page of Google searches. Having reached so high in the rankings, the page clearly has potential. A tf-idf analysis can help you to work out the exact tweaks and additions you need to make that last leap up onto page one.
A tf-idf analysis is superb as inspiration for content. Performing an analysis on pages ranking well for certain subjects and topics will show you what your own content needs to cover. That can be a great basis for sketching out a plan for a whole host of new content.
If you’ve got a page that used to be a top performer but that’s slipping down the rankings for important keywords, tf-idf can help there too. It can show you for which keywords and topics the pages overtaking yours are achieving better tf-idf values. You can then improve and update your own content accordingly.
There’s so much to consider in the modern world of SEO. Site architecture, links, keyword densities and all those other traditional elements remain crucial. It can be argued, however, that content is now king. Or at the very least that it needs to be given as much attention as any of those other factors.
No longer can sites get away with keyword stuffing or with filling pages with duplicate or hidden spam content. Sites need to contain high quality content which is genuinely useful to readers. Tf-idf is a major way in which Google and other search engines assess content in that regard.
Its crucial, therefore, to understand how tf-idf works and how it relates to SEO. A proper understanding and implementation of tf-idf for SEO can help you to enrich your content and see the rewards in your organic traffic.
Nick Brown is the founder & CEO of accelerate agency, a SaaS SEO agency. Nick has launched several successful online businesses, writes for Forbes, published a book and has grown accelerate from a UK agency to a company that now operates across US, APAC and EMEA and employs 160 people. He was also once charged at by a mountain gorilla