Archive for the ‘SEOmoz’ Category
Build Your Own Weighted Sort (GA Style)
Posted by Dr. Pete
If you’re a Google Analytics fan, you probably already know that Google released a new and incredibly useful featured called Weighted Sort. If you haven’t seen it, here’s a quick example – let’s say you want to know which of your referring sites have the highest bounce rate. You could pull up your referrers, sort by bounce rate, and get something like this:

Fascinating, right? I now know that I lost 7 visitors due to 5 sites. If I could just get that bounce rate down to 60%, I’d have 3 more visitors. Wow. What did you really want to know, intuitively? Probably something more like this:

That’s better – it’s not the absolute highest bounce rate you wanted to know about, but the most important high bounce rate referrers. In a nutshell, that’s the question weighted sort tries to answer.
How It Works
So, how does weighted sort work, exactly? Avinash Kaushik wrote a fascinating and very transparent post on the method behind Google’s weighted sort algorithm. I encourage you to read his post and I don’t want to copy it, but I’ll try to do a very basic review here.
Google uses something called the "Estimated True Value" (ETV). ETV essentially says this – if the count column of the sort (in this case, Visits) is very low, assume that the column of interest (Bounce Rate) is roughly the average for the data in question. In other words, if a row has 1 visit and the average bounce rate is 75%, then set the ETV of bounce rate for that row to 75%. Since 1 visit isn’t enough, statistically speaking, to make any really conclusions, we’ll essentially ignore it.
On the other end of the spectrum, if you have a very high visit value, assume the bounce rate is accurate as is. Simple enough, right? What about values in the middle? Well, Google sets the ETV somewhere in between the average and the row’s bounce rate. Exactly how much of each they use is the tricky part.
The Equation
This is where Avinash’s post ends and mine really begins. I should warn you – it’s not going to get Ben-complicated, but there is going to be some math. After a bout of 4am insomnia, I pieced together a simplified weighted sort equation. I’m going to present it first, explain it, and then provide an Excel spreadsheet with some real-life examples.
Let’s assume we’ve got a data set exactly like above – visit counts and bounce rates for a set of referring sites. We’re going to need 4 sets of variables:
- V = Visits for Row X
- B = Bounce Rate for Row X
- MV = Max Visits for the data set
- AB = Average (mean) Bounce Rate for the data set
For any given row, the ETV of Bounce Rate – ETV(B) – can be represented by the following equation:
ETV(B) = (V / MV * B) + ((1 – (V / MV)) * AB)
Crystal clear, right? It’s not really as bad as it looks. Let’s take an example – say we have the following data (same 4 variables as above):
- V = 100
- B = 80%
- MV = 500
- AB = 60%
The ETV(B) will consist of two components:
- V / MV * B = 100 / 500 * 0.80 = 0.20 * 0.80 = 0.16
- 1 – (V / MV) * AB = 1 – (100 / 500) * 0.60 = 0.80 * 0.60 = 0.48
- ETV(B) = 0.16 + 0.48 = 0.64
Pay attention to the parts in bold – since 100 visits is 20% of the max visits for this data set, this row gets 20% of its bounce rate from the actual value and the rest (80%) from the average value for the data set. So, essentially, how much we use the "real" bounce rate for the row is a function of the proportion of that row’s visit value to the visit value of the top referrer.
Build Your Own
Want to try it yourself? You can download my Excel spreadsheet and see the formula at work across a larger data set of actual referring visits from my own site. Although this replicates a function you already have in Google Analytics, it can be used for all sorts of applications that you don’t have in GA, including PPC metrics (Visits by Quality Score, for example).
There are actually four sheets in the Excel workbook:
- Basic ETV formula
- Google’s ETV sort
- Weighted ETV formula
- Log-based ETV formula
Those last two require a bit of explaining. In my very simple model (1), I calculate the average bounce rate by just taking an average across all the rows (for this data set = 70.6%). The thing is, that’s not how Google calculates the average bounce rate. They actually weight it by the number of visits, which makes perfect sense. So, in Google Analytics, my bounce rate for this data set is 74.6%, which is what (3) shows. If you compare (2) to (3), you’ll see that my weighted formula only differs in the Top 10 by rows #8 and #9 being swapped.
My approach is a pretty good approximation for this data set, but it’s still just an approximation. If you have a very large range of visit values (1 to 100,000), you might find that rows with smaller but still interesting counts (1,000+) get unfairly ignored. Sheet (4) is a more complex formula that uses the Log (base 2) of visits instead of the raw visit value. This has the effect of de-emphasizing the visit count in favor of the "real" bounce rate for that row.
If you’re still with me at this point, I hope you’ll play around with the spreadsheet. If you find issues with your own data sets or discover some better/cooler way of doing it, please share it in the comments.
Link Building 101 – The Almost Complete Link Guide
Posted by scott.mclay
This post was originally in YOUmoz, and was promoted to the main blog because it provides great value and interest to our community. The author’s views are entirely his or her own and may not reflect the views of SEOmoz, Inc.
A lot has changed since I got into link building a few years ago – link exchange is dead, ad banners are no longer all about gaining referral traffic, and buying links is more dangerous than ever before. Because of the changes mentioned and a whole load of others the majority of link builders don’t like to give away their secrets to sourcing links, even though it’s all pretty much the same at most agencies.
Most of the advice I will be giving throughout this post is most likely available from a large number of sources including SEOmoz but I felt it would be great to bring everything together under one simple guide.
Creating your link building strategy
Before building any links to a website it is important to ask yourself a few questions like:
- What kind of links do you need?
- Do you need nofollow and branded links?
- Do you have a wide enough range of anchor text and landing pages?
Sadly every strategy is different and people can’t answer these questions for you but hopefully you can use the following information to help answer the questions.
Link Placement
In recent times Google has started placing value in link placement, in-content links passing the most value and footer links passing the least, although a good link profile should make good use of every link type as going out and sourcing only in-content links would be a very big sign of an unnatural link profile.
Nofollow & Dofollow
A lot of people go out and source dofollow links, but in doing this they do more harm to their link profile. Every website should have a good balance of nofollow links – there have been cases where sites with a very low number of nofollow links have not ranked as highly as others who keep a good balance.
Branded Links
I believe Domain Authority and Domain Trust make up a fairly large chunk of the ranking algorithm. Even though there are loads of factors in measuring these attributes, one good sign of both is having a good number of brand based anchor text pointing to your website. Some people make the mistake of only building branded terms to the homepage, when in fact there is more value in building links using these terms to landing pages throughout your website.
Anchor Text & Landing Page Distribution
When working on a link building campaign, it is important to work on a wide range of landing pages, using a variety of anchor text for each. Working on a small keyword / landing page set can upset the balance of a website and can have a very negative impact long term.
Content Relevancy
Since the Google May Day update this year, relevancy seems to play a larger role in the ranking factors. Even though the days of keyword stuffing are over, there is still a need to reference your keywords within your content, header tags, URI structure and title tags. Content may not be king but it is one of the keys to a successful link building campaign.
Sourcing Links
Once you have your link building strategy done and dusted the next step is to find suitable websites to source links from. There are many techniques that can be used for this job, some of which rely on tools and others that use manual search queries.
Link Building Tools
If you plan to use link building tools then chances are you will be looking at links going to competitors’ websites. This is one of the best ways to start a link building campaign and can lead to positive results, some of the best tools for this job are:
- Open Site Explorer (Free / SEOmoz Pro Members)
- Competitive Link Research Tool (SEOmoz Pro Members only)
- Majestic SEO (Paid)
- Yahoo Site Explorer + SEO Quake Plugin (Free)
Manual Search Queries
It is said that it’s not the links your competitors have that will give you the edge but the links the competitors don’t have. To find these you will need to find link opportunities using manual search queries – the best way to do this is by using advanced search operators.
Advanced search operators are not as complicated as they sound but if used correctly they can provide a very nice set of search results. An introduction to advanced search operators can be found here and a short introduction can also be found on my personal blog under the post finding the links that matter.
One search string I would recommend when looking for suitable blogs for most niches is:
[search term] -site:Wikipedia.org -site:blogspot.com -site:telegraph.co.uk -site:wordpress.com -site:about.com -site:nationalgeographic.com -site:guardian.co.uk -"directory" -"add link" -"advertising"
Depending on your niche other domains can also be stripped from the results.
Directory Submissions
Directory submission is the most boring and repetitive job, but sadly it is an important task in any link building campaign. These links make up the numbers when it comes to branded links. Submit to the right directories and they will improve your overall domain authority.
Blog and Forum Commenting
Blog and forum commenting for links is seen as spam due to many people taking advantage of unprotected blogs and forums. If blog and forum commenting is part of your outlined strategy then some effort should be put into them.
The best way to act on this kind of link building is by using Google’s blog search to find the most recent articles published within your niche then make a genuine comment based on the content of the article, using the Name field for branded anchor text. This type of link building is best for increasing the number of nofollow links to your website.
Guest Blogging Communities
Guest blogging is a great place to find blogs within your niche, but instead of offering to do a guest post why not offer to write a few pages (I say pages as they are linked to via the top navigation) of content for them? After all these people want content and being able to source multiple pages not only saves time but can also lead to Google seeing the links as trustworthy, just remember to link out to authority sites within your niche as well.
Widgets & Theme Designs
There has been a lot of talk about creating widgets to increase the number of natural user generated links, which does work, but the widget you create does have to be unique and worth having so there isn’t a gap for this in every niche.
Another way to increase the number of user generated links is by creating a WordPress theme, a lot of people have said there is low value in this but if the theme is good enough it can generate 40k+ links (from previous experience). If you wish to go down this route the best way to market it is via your monthly newsletters, just put in a small section about it and wait for results, but remember to also submit it to theme hubs around the web for additional exposure.
Link to Us Pages
Link to us pages are not only great for increasing the number of user generated links but great for masking other link building activities. I would suggest having a link to us page displaying all the branded terms used within your campaign and have different types of links for each (Banner Ad, Contextual Ad, Text Link).
Competitions
If you client is running competitions contact bloggers in your niche and ask politely if they would blog about it. Although getting targeted anchor text through this tactic is harder it can help build the number of generic keywords linking to your domain.
Contacting Webmasters
Making contact with webmasters is one of the most difficult jobs – just about every email sent out needs to be personalised and in some cases contact is needed via social media before an email has been sent.
When sending an email to a webmaster, remember they are a real person just like you, so ask yourself a few simple questions before drafting:
- If you were the webmaster what would persuade you to link out?
- Would you rather a relationship was formed before receiving a link request?
- Should the email be from an SEO’s point of view or would it be better keeping it simple and to the point?
Tracking Progress
Tracking the progress of your link building campaign is something that needs to be done. This can be done in a variety of ways but the best solutions I have found is using Raven Tools for overall tracking of performance and using an Excel document to keep a list of links built containing metrics such as Page Rank, mozRank and Domain Authority.
Having a list of metrics for each link enables you to display a variety of information relating to your link building campaign which helps when generating reports for your clients.
Conclusion
Although link building is a tough task in itself if you plan your strategy properly, build the correct links and track the progress of your strategy the job will become easier over time and you will begin to see what works and what doesn’t for your client.
Just remember every link building campaign is different, even if you deal with clients within the same niche as each website has a different infrastructure and domain history.
If you enjoyed this post then why not visit my Blog or follow me on Twitter.
What is Mobile Search Engine Transcoding?
Posted by Suzzicks
Normal
0
false
false
false
false
EN-US
X-NONE
X-NONE
Ok, in the mobile world, it is important to understand that Google sometimes lies (Uhhh! Say it ain’t so!) Actually, all of the major search engines do it with mobile results – It is called ‘transcoding.’ In some cases, the search engines will want to rank a particular page in mobile results, but they know they shouldn’t because they can tell that it will be a bad mobile user experience. (Usually because the file size is too big, or the page has lots of mobile-unfriendly code like Flash or loads of JavaScript).
When this happens, the search engine will show the full search engine listing for the mobile-unfriendly page (like normal), but when you click on it, they will automatically take you to a temporary url that represents a ‘transcoded’ version of the page you requested, (rather than delivering you to the actual page listed in the search results). This temporary transcoded page actually lives on a subdomain hosted by the search engine, and shows a scraped version of the page you requested. The scrape usually just shows the text and small images of the page, but omits anything that might cause problems for a mobile browser; sometimes this can include background images, big images, animations, videos, iFrames, and heavy/complex code.
You Might Want Transcoding, but Probably Not
If you have totally ignored the mobile web, transcoding can be a good thing, because it allows you to rank in mobile results when you otherwise might be omitted. (Ranking with transcoding is better than not ranking at all). Unfortunately, none of the search engines do a stellar job with their transcoding. In Google, pages that are transcoded usually closely resemble the ‘text-only’ version of the page that Google keeps in its cache. In some cases though, the transcoding can really mess up a page, missing core navigation, breaking long pages into multiple pages at odd places, or cutting out important sections.
Remember that the search engine use of transcoded pages differs from phone to phone, so just because pages are not being automatically transcoded from search results on your phone does not mean that they are never being transcoded by the mobile search engines. The less sophisticated a mobile browser is, the more likely the search engine is to transcode a page; based on my experience, this is happening mostly on BlackBerry’s and WindowsMobile devices. To see what a page looks like when Google transcodes it, there are two options:
1.) You can perform a search on a mobile phone, then click the ‘options’ button to the right of one of your results in the SERP, and then select ‘Mobile formatted.’ (Illustrated below)
2.) You can also put your url into Google’s tool, here: http://www.google.com/gwt/n? from your computer or your mobile phone.
The image below shows what Realtor.com looks like when it is transcoded by Google, and it is obviously not a great experience. You can see, in this instance, that two header images are missing, including the logo. It also turns the JavaScript navigation into text links that are a bit squished together, and hard to understand (Find a Home Home Finance Home & Garden). Last, since the transcoding software can’t render JavaScript, it has been served an error message, telling it to turn on JavaScript.
Preventing Transcoding
If you are pretty confident in your mobile site rendering, you can include the ‘no-transform’ cache control in the headers of your template, and that will usually prevent your pages from being transcoded by the search engines, but it is not 100%. The good news is that with faster network connections and better mobile browsers, transcoding by the search engines is becoming much less common. The important take-away here, is to at least test to see what your pages look like when they are being transcoded (even if you have a no-transform cache control in place). In many cases, minor on-page code tweaks can make the transcoded experience much more user-friendly and palatable, improving your ability to reach the widest range of mobile customers, regardless of the phone they are searching from.
SEO Strategy: Predicting Yearly Site Traffic
Posted by Kate Morris
As a consultant, I work with many In-House SEO teams with strategy and other issues that arise throughout the course of the year. One trend we are seeing is that these In-House teams are having a hard time coming up with accurate traffic-centered goals. Traffic is the base for many metrics measurements, so being able to semi-accurately predict that number in the coming year is important for every business.
I can hear you all now, "Well there is the Google Keyword Tool … use that." Typically, that is my answer too, but there have been major questions about the accuracy of Google’s keyword tools and others available to webmasters, marketers, and search engine optimization teams.
(If you will comment with your favorite keyword tool other than those I mention, I’ll happily test and add it here!)
The Google Keyword Tools (yes, plural)
There was a shift recently with the Google Keyword Tool. The Legacy/API version is showing different numbers than the newest Beta interface. David Whitehouse and Richard Baxter both noticed this shift as well and did a few tests on accuracy. The jury is still out as to which version is more accurate, the legacy or the new keyword tool. But I believe like Mr. Whitehouse that the newer tool is the updated one, but that does not make it more accurate.
To be clear, when I speak of the Legacy, API, and Beta tools, I do mean different versions of the Google Keyword Tool. First, from what I can see using the SEOmoz Keyword Difficulty tool, the Google API pulls from the Legacy tool, so they are one and the same. The Legacy tool is the prior interface for the current Beta version of the Keyword Tool. We had previously assumed that these pulled the same numbers, but my research and that of others proves otherwise.
But wait! *infomercial voice* There is more!
There is also the Search-based Keyword Tool that aids AdWords advertiser’s in choosing relevant keywords based on search behavior and a specified website. This tool is explained by Google here and gives more in depth information on account organization and cost.
But even this tool is not on par with the other two when it comes to impressions. A random query in the Search-based tool returned a suggestion for the keyword "maragogi." The Search-Based tool says there should be 12,000 monthly searches. The Legacy tool returns 110 Local Exact match searches, 33,100 Global Exact match, and 201,000 Global Broad match. The new tool returns information only for a global setting (all countries, all languages). That returns 74,000 searches broad and phrase match, and 12,100 for exact match. It seems like the Search-based tool is more like the exact global match in this one instance. But what is a business supposed to do with all of these numbers?!?!?
(hint: always use exact match)
Back to Strategy
If these tools are possibly inaccurate, how do our clients go about setting their yearly strategy goals?
Put simply, in search, you never want to rely on one set of results or one ranking report. Data over time and from many sources is best. But with the lack of tools out there and Google bringing in at least 65% of traffic organically for most sites, how do you get the best numbers?
Impressions
First, you need to start out by figuring out how many impressions a keyword or set of keywords can bring in on average for a specific month. If you are in a cyclical industry, this will have to be done per month of the calendar year.
1. Pull from both Google Tools and other Keyword Tools
Below is a look at some information I pulled using the tools mentioned for the key phrase "curtain fabric."

The idea here is that if you take into account all of the numbers out there, you might see a trend that you can use for estimating future traffic. If there is no trend, then a median of the numbers can be used as your metric. A few other tools that you might look into include Word Tracker and Keyword Spy. You can see that the numbers are all over the place, but looking at these figures, I’d guess that the keyword might bring in around 6,500 impressions a month in the UK.
The downside is that WordTracker and KeywordSpy don’t allow you to look at exact match information versus broad match. When performing keyword research, you always want to look at the local (target to your country) exact match information. Too many people pull keyword information use broad match and get inflated numbers for all phrases related to that key phrase.
2. Run a PPC campaign if possible.
The absolute best way to get accurate numbers about traffic over time is to run a PPC campaign. I pulled some numbers from a few campaigns (for our client’s sake we have masked a number of the actual key phrases) in attempts to see if the new keyword tool is accurate to actual trafffic in August. The keywords pulled were all exact match in the campaign and the information pulled from the keyword tool was Local Exact and set to the country that the campaign was targeting.

As you can see, some of these are higher and some lower. What I found that there really is no definitive answer of if the Google Keyword Tool is accurate. Take a look at the results for the example I used before, curtain fabric. The campaign saw 11,389 impressions, much higher than the new keyword tool, and lower than some other keyword tools. This is why a well run PPC campaign is important if you want to get a more accurate look at impression numbers.
Please note that I didn’t get a chance to ensure that these accounts were all showing at all times during the month, but they were all accurately geo-targeted and all showed on the top of the first page on average.
Finding Traffic Based on Rank
After getting a good idea of the number of impressions, you then need to take into account where you are showing for that keyword on average organically (aka your rank). While we cannot know specific click through numbers for every search done on the web, there have been some studies done on how much of those impressions the top organic result gets, the second and so on. The one I used the most often is from Chitika. Using the percent of the traffic below and the impression numbers, you should be able to get a good idea of the visitors you can expect per month organically for a specific key phrase.

So using the "curtain fabric" example, assuming that the site I am working on has maintained an average ranking over the last few months of #3 organically, I could expect about 1300 visits from Google for the keyword in a month (11.42% of 11,389 impressions).
Past Metrics
Once you get everything figured out, keep in mind that your past metrics are another good way of seeing how close you are to getting the traffic about right. Assuming that no major changes have occurred (like lack of metrics data in the last year), a look back is the most accurate way to understand traffic flow and trending on your site. Pull the unique visitors for every month of the last year and do some analysis on percent increase month over month. This can be done on any level in most analytics programs - overall traffic trends all the way down to the keyword level.
A look at overall traffic per month in Google Analytics for organic searches from Google:

A look at traffic for a specific keyword over the last year per month from Google organic:
.png)
Educated Guesses
In the end though, making predictions are just that, educated guesses. Pulling data from all available sources and using your own historical data can assist in making an educated prediction for the next year. Keep in mind though that things never stay the same. Google Instant just proved that with one of the biggest changes we have seen in a while.
Exportable PDF Reports Now in the Web App
Posted by randfish
Since the launch of our beta web app, the #1 feature request from folks writing in to us has been the inclusion of exportable, PDF reports. Today, I heard from our engineering and product teams that these are complete!

You’ll find CSV and PDF exports on most of the reports in the web app from here forward
The new PDF reports include:
- Rankings Overview
- Rankings History
- Crawl Diagnostic Issue Detail (the Overview tab is still in progress)
- On-Page Overview
- On-Page Report Card
Here’s a sample view of a few reports from accounts I’ve created:

On-page report card in PDF form

Weekly rankings report in PDF form
In addition to PDF, since launch, the web app has offered CSV export for nearly all the sections included.
While we’re excited to make this available, we know we still have a lot of work to do on the web app – a few crawls are still giving us trouble, we’re refining some of the errors, warnings, notices and recommendations it issues and there’s plenty of big features that are on their way. A big thanks to everyone who’s trying out the web app today – if you have requests or issues for us, please do use the feedback tab on the side of every page (these go direct to product & engineering).
If you’re curious about upcoming features, you can see more in this blog post.
p.s. One of the other features that’s been heavily requested is white labeling. That’s probably a few more months away, but we definitely appreciate (and are flattered by) the desire to add your own logos/branding to the reports.
Discussing LDA and SEO – Whiteboard Friday
Posted by Danny Dover
In this week’s Whiteboard Friday Rand Fishkin and Ben Hendrickson discuss LDA (Latent Dirichlet Allocation) and SEO (Search Engine Optimization). There has been a lot of discussion about the relationship between these two topics lately and this video answers many of the questions people in the community have been asking. It is comprehensive (25 minutes) and uses many easy to understand diagrams and examples to discuss what impact LDA may have on the SEO industry. We look forward to reading your comments below.
Video Transcription
Rand: Howdy, SEOmoz fans. Welcome to another edition of Whiteboard Friday.
Today, I am joined by Ben Hendrickson. Ben?Ben: Hello. We’ve met before.
Rand: Have we really?
Ben: I think so.
Rand: So, Ben is our senior scientist here at SEOmoz. He does a lot of our
research work and has been working on some interesting projects.
Lately, we posted about one of those projects and asked for some
feedback and got some great responses. A lot of people are very
passionate, very excited. And some people are a little confused. So,
we wanted to dive deeper with this LDA stuff.What’s LDA, Latent Dirichlet Allocation. We wanted to talk about topic
modeling in general. There was some feedback, right, and I am sure
you saw some of it too, that was like, "I’m not quite sure. You’re
saying on-page maybe is more important because of this LDA stuff,
and I always thought on-page just meant keyword density or stuffing
your keywords."Ben: Yeah. Clearly words used matter. For any given SERP, a huge number of
links aren’t going to rank for it because they have nothing to do
about it because they never use the word at all. Right? I mean,
Google.com ranks a very few things and it has a ton of links. So, of
course, words matter that are on the page.Rand: But we’ve always, as an SEO, even when you’ve done your previous
research, it was sort of like, boy, it sure does look like links are
a whole lot more important than . . .Ben: Using the keyword in the title box. Right. Yeah. So this was
something that actually was very surprising for us, which is why we
showed it. What was that? It seems like using other sort of related
words to the query in a very specific way seemed to help a lot.
Right?Rand: And we were kind of weirded out by that.
Ben: Yeah.
Rand: Or we were at least surprised by that. So, that is why we are sharing
it. So, let’s go back in time a little bit and talk about this whole
. . . for people who are kind of going, "I don’t understand what you
mean when you say it’s more sophisticated than keyword density, or
it’s more sophisticated than a normal keyword metric or keyword
usage." Keyword density is just like the percent of times that the
word is used out off all the words in a document.Ben: Yeah.
Rand: Super simple to game. Kind of useless for IR is my understanding.
Ben: Well, I mean, it gets you a lot of the way. I mean, at least you have
that word in the document you return to people. But, like your blog
post earlier in the week showed, there is a lot of basic situations
where you can’t tell what is the better content just by doing this.Rand: Right. And so, IR folks in the ’60s came up with this TF-IDF thing,
which is essentially like looking at whether the terms that are
being used are more frequent in the corpus as a whole. So, if you
are like a library, they look at all the books in the library. Or if
you are a card catalogue, they’ll look at all that. And now that
there are search engines, they look at all of the documents on the
Web.Ben: Yeah, right. So, the big intuition here is that they are searching
for multiple words. The word that is rarely ever used is the one
that actually matters the most. So, if you are searching for the
SEOmoz building, a document that includes a building and SEOmoz is
probably very relevant. A document that contains "the building" or
"the SEOmoz" is a lot less relevant. So, the basic story there is
that you are biased against caring about words that are very common.Rand: Right. So I like your Lady Gaga example where you’re like, well,
documents that have Gaga on them are probably way more relevant than
those that just have lady on them, even though lady and Gaga are
both four letter words in the phrase.Ben: Yeah, exactly.
Rand: All right, cool. So we evolved to this TF-IDF stuff. And then there
is this like co-occurrence thing, which we talked about on the
SEOmoz blog a long time ago. Co-occurrence is kind of interesting
where we look at, and let me make sure I am getting this right. It
is essentially that, oh well, oftentimes when I see, for example,
Distilled Consulting and building and SEOmoz and building, I find
those frequently together because it turns out that we share offices
with Distilled and we do lots of work together and those kinds of
things. So, maybe a document that has both Distilled and building
and SEOmoz might be more relevant than just the one that just says
SEOmoz.Ben: Exactly. Right. So, if you are trying to basically figure out if it’s
just an offhand reference to it or if it’s something that is
actually valid a whole lot, right, the fact that it is using a whole
lot of other words that also occur with the keyword would be a good
indication of that.Rand: But then topic modeling, I think that even I get a little bit
confused when I think about topic modeling versus co-occurrence,
because it seems like topic modeling is maybe very similar to this.Ben: Well, this is great because you drew a Venn diagram that shows the
difference really well.Rand: Right. Super smart of me.
Ben: It’s like you kind of knew. So you can imagine that you could have a
whole bunch of words that would have a very high co-occurrence with
Star Trek. Right? You could have documents that talk about gravity,
space, planet, and tachyon. But it still might not be about Star
Trek, even though you’ve got four words that co-occur a lot with
Star Trek. It could about astronomy. Those are all real things that
exist in the real world, or at least people think they might exist
in the real world in the context of tachyons. But if you have
something that is talking about tachyons and gravity and William
Shatner, that’s probably Star Trek. Right?And so, it’s not just the number of words you have that co-occur.
You are actually trying to figure out are these words being used in
the context where they are talking about Star Trek, or are these
words being used in the context of talking about astronomy. The way
we can do this is because in general fewer topics is better. So,
it’s possible that we have something that is talking about astronomy
and TV and it happened to use gravity and tachyon and William
Shatner in the context of something else he did. But it’s more
likely to just have . . .Rand: So normally, we might say like, "Okay, I can imagine Google using
this to try and do a couple of things." Right?Ben: Right.
Rand: For weird queries, where maybe the word Star Trek wasn’t used but
they think it might be about that and they think that’s what the
person wanted, maybe they would do it. But for ordinary rankings, it
seems like using these words when I’m talking about astronomy or
using these words when I’m talking about Star Trek isn’t going to
help me any more than not using them. But then we did this topic
modeling work and we tried to analyze that. Right? So we used a
process called LDA, which maybe we can talk about in a sec. But we
used this process to basically build a model that has all these
different topics.Ben: Right.
Rand: And essentially, the topics, as I understand them, aren’t actually
keywords. They’re just like a mathematical representation of a
subject matter. Like you were saying there’s probably a cartoon
topic, but it’s not like the word occurred necessarily.Ben: Yeah, right. So, it has actual words in it. Right?
Rand: Yeah.
Ben: You can look at a given topic and you can see all of the words in it
and see how much each word is in it. But no human went by and said
we should make a topic about this to show what words may be put
together. So, if you look at papers, people pretty much refer to
topics by whatever the most common word in it is, which in the case
of cartoon might be cartoon.Rand: Like I remember one of the early ones we were looking at was
Transformers.Ben: Yeah, right.
Rand: It was like, oh, well, Optimus Prime and Megatron and Sydney, the
woman who’s in the all of the movies now. She came up a lot. Megan
Fox was in there.Ben: Is she related to Vanessa Fox.
Rand: I don’t think so.
Ben: Okay.
Rand: In fact, I strongly suspect no.
Ben: Okay.
Rand: I’d guess it’s a screen name. But so, in any case, you get these
topics. You have these words in them. And then when we say, "Well,
how much does this matter? Like how much does it matter if I am
writing a page about Star Trek and I have lots of links pointing to
me, but I’m not ranking as well as I think I should. Could it be
that maybe I have not included keywords that would tell Google that
I am actually about the topic Star Trek or about related topics?"
Yes. And so, we don’t know how important that is. And that’s why we
did something about correlation to try and figure this out.Ben: Yeah, right. Because, obviously, we don’t work at Google.
Rand: We just have to look at the outcome.
Ben: We have to look at the search results and then decide if this seems
like what they are doing. Yeah. So we try to see.Rand: All right. So, let’s talk about that correlation process. So Ben,
we’re talking about this correlation thing and a part of me is kind
of going like, as a classic SEO, like non-statistics, math major,
this kind of thing, I kind of go, "Isn’t the best way to test
whether this works is to have like two random documents on the Web,
and I’ll try putting your LDA stuff to work and see if it raises up
one of them or doesn’t raise up the other?" And I can do tests that
way. Like, what’s this correlation? Why do I need that? Is that a
better way to do it?Ben: I mean, they are just different. We’ve tried doing control tests
where we put the keyword and title tag on one and not the other and
we see which one ranks. But it’s very hard to do enough of those to
reach statistical significance. It’s pretty easy to set ten websites
where one is doing stuff one way and the other is doing stuff the
other way. But you end up doing like four one way and six the other,
or three one way and seven the other.Frequently, a lot of these effects aren’t that big. Google sees it
as hundreds of things that influence SERPs. So even if you try to
control for as many variables as you can to try and make it the same
between these two, there is just a lot of noise in terms of what
actually ranks higher. So it takes a very large amount of work to
make enough samples to say something with statistical confidence.Rand: And you never know when you might have some weird factor that is
influencing all of them in some weird way.Ben: Yeah. There is another problem that you are probably looking at this
really tiny page and little tiny domains because you are not setting
a huge number of large-scale domains to try to this out. Right?Rand: Right.
Ben: So you are going to get an answer. The question is: Is this answer
going to scale up to real pages people care about from my small
pages that have ten links to them? So, it is a very interesting
process, and I actually would be very fascinated that people get
good results from it. But, we have tried it and the results have all
kind of been . . .Rand: Middling at best.
Ben: Middling, yeah.
Rand: There are no good conclusions from anything. So instead, we use this
correlation process. Right?Ben: Right.
Rand: If I understand your process right, you basically run across not a
dozen or a hundred, but hundreds or thousands, in some cases, of
different search results looking for elements that will predict that
something ranks higher or lower.Ben: Yeah.
Rand: And so I saw that Danny Sullivan left some great comments in our blog
post about LDA. He said, for example, "Well, you guys said that
correlation with keywords in the title is very low. I don’t believe
that at all because, when I look at search results, all the search
results I see almost always have the keyword in the title tag. So,
what are you measuring here that I’m not seeing?"Ben: Right. The difference is measuring what a keyword is in the search
results versus measuring what is correlated with making it appear
higher in the search results.Rand: So if all of these included the keyword Star Trek in the title
element, then what’s the ranking correlation of the title element
with the keyword?Ben: It would be zero. Right?
Rand: Because they are all the same. What’s the possibility that something
will be a blue link appearing on Google?Ben: That’s an interesting thing. We computed some data a while ago using
the correlations where we were comparing Bing and Google. It
actually was interesting to see Google tends to have a lot of stuff
with this element. Bing had fewer things with that element. It
actually tells you how the search engine is different. It’s
interesting just looking at raw prominence when you are trying to
compare two search engines. But it’s not very interesting when you
are trying to compare two features because . . .Rand: Or when you’re trying to figure out what will help you rank well.
Ben: Exactly.
Rand: Okay. So, got you. So what Danny Sullivan is talking about with this
"I see the keyword in the title tag like 70 percent of the time or
more," that’s this raw prominence thing.Ben: Right.
Rand: That’s like how many times does it appear in there? But correlation
of a specific feature with ranking higher is essentially looking at
all of these and then saying like, hmm, you know, on an aggregated
basis across hundreds or thousands of search results . . . I think
the study you did for the Google/Bing thing was like 11,000
different search results. Right?Ben: It took a long time making search, writing it down on paper.
Rand: Yeah. I bet it did. You’re totally incredible for having done it
manually. So, you look at all of those and then you would say, "Oh,
well this particular element on average like, having the keyword
exactly match the domain name, the top level the domain like it does
here, boy that sure looks like it is correlated with ranking much
higher." I think having the keyword in the domain name was one of
the highest correlated single features that we saw.Ben: Yeah, right.
Rand: And the same thing goes for number of linking word domains, like
diversity of different link sources that you got. Like in tons and
tons of different websites, I have a link to Amazon, that seems to
predict or correlates well with it doing pretty darn well.Ben: Right.
Rand: And if I recall, I think correlations for title tags and keyword-
based stuff, with the exception of the domain name, was in the like
0 to 0.1 range. Maybe 0.15, something like that.Ben: Yeah. In fact, some of them were actually a little bit negative.
Rand: Why would it be negative?
Ben: Because it is quite plausible that if it’s in the title, someone put
it there because they would like to rank higher than they actually
do and (_________) a lot of other things and it’s just not a very
good page.Rand: So you’re saying, because of keyword stuffing SEOs, there could be a
negative correlation or other conflicts.Ben: Yeah. Exactly.
Rand: So this on-page stuff, pretty small correlation. Right? So then, we
looked at things like links. A lot of those were in the 0.2 to 0.3
range, with 1 being a perfect correlation. So there was like a link
to your domains. That was pretty decent, like 0.24 or 0.23 or
something like that. Things like page authority, which is a metric
we calculate, was really quite nicely high. It was like 0.35 almost,
0.34, something like that.Ben: I can’t confirm or deny these numbers. I don’t remember them off the
top of my head.Rand: All right. But there are different ranges. Right?
Ben: Yeah.
Rand: So, when we looked at linking stuff, it was almost always better than
on-page stuff.Ben: Yeah, right. Links seem to be, if you had to develop a Google search
algorithm to sort the things and you had to make a choice of Google
as you could, just looking at links seemed to get you most of the
way in terms of anything that we did.Rand: So then when we saw this LDA thing at 0.32 something, that seems
whacky. That seems crazy high for an on-page factor, because we
never looked at anything that was about the features of the words or
how you use them, with the exception maybe of the keyword in the
domain name, that was this high in correlation. So that sort of
struck us as being very odd, and this is one of the reasons that we
wrote about it and were excited about. But let me just throw this
out there. Correlation is not causation. Right? It could be that
maybe domain name is really the thing that is being ranked. But
maybe it’s other features. Right? Correlation doesn’t necessarily
mean that that is what is causing it.Ben: Right. And almost certainly our LDA model is not causing it, because
Google doesn’t use our LDA model. They’re not asking for numbers.
Right? Then almost certainly Google is not going to do LDA like we
have done it. They have not used our corpus. We have a model that is
correlated with Google’s results, and it is certainly not causing
Google’s results. But the thing is that it is a very high
correlation. So, they are doing something that is somehow producing
results that are correlated with a LDA model. It is hard to imagine
really what that would be, unless it was some sort of topic modeling
or something like looking at the words used on the page.Rand: So, there’s two things that come out of this. One is that, to my
mind, when I see something that high and assuming all the numbers
look right, I think some people gave your numbers a hard time, but
it looks like at the least the criticism they have received so far
has not made us doubt that we have done something wrong.Ben: Yeah. I spend most of the day running code. But it is quite plausible
that I did something wrong. I’m sure I have. But the specific
complaints people have come up with so far aren’t very credible.
But, you know, in the future, it will certainly happen someday.Rand: I’m sure we are all excited for that day, Ben. Assuming that these
numbers are quite high, doesn’t it sort of say like maybe we’ve been
wrong about this on-page stuff not mattering all that much? Maybe we
should do more on that front, like more investigation, test out the
results, try putting our keywords on the pages in certain ways.Ben: Well, Google always says to spend time writing good content. Right?
And that’s a little bit hard to apply, but you can interpret that as
being right content makes it clear what your topic is by using words
that are going to eliminate any topic from being (________) except
for the one that you are trying to rank for. So, I don’t know if
it’s that revolutionary. It seems like people have worried a lot
about their content in the past and a lot of people say to do so.Rand: But so people in the past, they talked about things like, oh, we
should use like the Google Wonder Wheel. And we should use related
searches and put those words on our pages. We should use things like
synonyms that we get from the service. Well, how is the LDA stuff
different? Or is it? Like if I just do these things, am I going to
do great over here?Ben: Well, I mean they are not going to be bad. But if you can imagine
that when you put a whole bunch of synonyms for tachyon, it’s not
going to actually help clarify if you’re about astronomy or Star
Trek. Right? So, you don’t actually or that you’re trying to discuss
bark collars and you want to just clarify that you are talking about
dogs as opposed to the stuff that wraps trees. You are not going to
want to put a whole bunch of synonyms for collars or barking. Yeah,
but that’s sort of weird and unnatural. You much more want to put
other related words to make it clear that we are talking about some
sort of bark preventive system.Rand: So, let’s talk really briefly about the tool today. It doesn’t do
exactly this. Right? Instead, it give us a score.Ben: Yeah.
Rand: All right. Let’s look that.
Ben: Okay.
Rand: Now this LDA score, tool might be an overstatement. It’s a Labs. You
can look and see it. It works. You can put stuff in. But we have a
lot of really beautiful tools here at SEOmoz, and this is not one of
them. So, it’s not the prettiest thing in the world. But it does
leverage the topic modeling work, and you use the specific process,
LDA, which we think is sort of better than some other ones, but not
being as good as the sophisticated stuff Google does.Ben: Almost certainly.
Rand: I enter a query up here. Something I want to rank for. I put in some
words here, and it will give me a percent telling me how topically
relevant it thinks this content here is to the word here. And it
will do the same thing like if I enter a URL down here, it will
populate this box with the content from that page.Ben: Right.
Rand: So this gives me sort of a rough sense of I can play around and see
does SEOmoz’s LDA tool work. LDA scores seem to predict anything
that I can rank better. So, I could look at the top ten results and
be like, "Wow, I’m winning on links. I think I’m doing a good job of
keyword usage. But boy, all these other people have much higher LDA
scores than I do. Maybe I should try increasing that." Is that sort
of a suggested application here?Ben: That would seem very reasonable to me. Like it is kind of new. No one
has a huge amount of experience with it. So far, it seems like
people have said that it chains up a higher score and it has helped
them rank, but that’s very anecdotal. There’s a very plausible
reason why you would think that that would work. But, we’re kind of
on the bleeding edge here.Rand: We’re not trying to say that like you definitely enter something in
here, you should use this and boost up the rankings of all of your
pages. It will work perfectly or anything like thatBen: Yeah, exactly. But it seems very plausible that basically getting a
higher score helps you rank higher. And the tool let’s you see
clearly what this kind of topic modeling is going to be able to
figure out. It sort of shows you the kind of connections that Google
certainly will be able to make in figuring out that pizza is related
to food but donkey is not related to food. So you can sort of
explore and see how this stuff works.Rand: Cool. One weird thing that people have noted and the last point is
that this fluctuates a lot. Oftentimes, when I run it, it will
fluctuate one to five percent change. Like I’ll hit go on the same
URL, the same content, the same keyword, and it will change one
percent to five percent. Sometimes it seems like it can go to maybe
seven, eight, or nine percent. A couple of people have reported –
we haven’t been able to see them — rare instances where it is more
than ten percent fluctuation. So, explain to me what is going on
there. What is the sampling that the tool does?Ben: Right. So there’s a very large possible number of ways that you could
explain the document with topics. It could be about Star Trek. Or it
could be about astronomy and TV shows. There are lots of different
ways that you could explain the different word usages in there. So
we can’t actually just try all of them and weight them by the
probability because that would take years to answer anybody. So
instead, we sample them based upon their likelihood and then we
average that. So, if you wanted to figure out are most people going
to vote Democrat or Republican this year, you might sample 100
people and you’re going to conclude that 40 percent are going to
vote Democratic this year.Rand: But then if you sample a different 100 people . . .
Ben: It will be a little bit different. Generally, you can come back and
say 70 percent are going to vote Democratic this year. It’s in
theory possible, but it doesn’t happen that frequently.Rand: Got you. So you can essentially use this number. If I was really
interested, I would have to get more precise. I could run it a bunch
of times, and I would be getting a bunch of different samples and I
would average those outBen: Yeah. In the back end, we’re doing it a bunch of times for you and
averaging them. So averaging it yourself on the front end as you go
isn’t terrible.Rand: It’s just a big use of our bandwidth.
Ben: Oh, yeah. It really helps our numbers of hits to our website.
Rand: Oh, yeah. I’m sure that’s all correlated with rankings too.
Ben: I know like unique visitors. What’s that?
Rand: All right. Well, Ben, we’re excited about this tool. We really
appreciate you doing this research work. It’s exciting and
interesting. I think we’ll know more in the future, in the months to
come, whether this is really great and applicable for SEO or that it
turns out that maybe it’s some other things causing this weird
correlation.Ben: Absolutely.
Rand: Well, thanks very much for obviously building this and joining us.
And thanks to all of you for watching Whiteboard Friday. We’ll see
you again next week.Ben: This was a long one.
Rand: Very impressed that you watched it. We do appreciate it.
Video transcription by SpeechPad.com
Priceless CRO Advice for $224
Posted by Dr. Pete
The past few years have seen an explosion of usability and Conversion Rate Optimization (CRO) tools hit the market. There have been many good roundup posts about these tools, but I want to focus today on a more in-depth approach to putting just 3 of these tools to work: (1) Five Second Test, (2) Crazy Egg, and (3) UserTesting.com. Total cost to do one round of testing: $224.
(1) Five Second Test ($20)
The premise behind Five Second Test is incredibly simple – show a visitor your site for 5 seconds and see what they remember (or, alternatively, where they click). This is a great starting point for getting some starter observations about your visitors.
How It Works
Setup is easy – just submit a screenshot of your web page or prototype (great for design comparisons) and the replies start coming in. You can view them individually or grouped by concepts. Five Second Test is actually free, but the $20/month package means you’ll get a larger response rate. It’s worth the extra cash, IMO. You can also earn credits ("karma") by taking other people’s tests – it’s kind of fun and can be informative.
What to Test
Think about the kind of things you want your visitors to know about in 5 seconds: The big questions: Who, What, Why. Here are a few uses I recommend:
- Do visitors recognize your brand?
- Do people get what you do?
- Is your tagline descriptive and effective?
- Is your page too visually noisy?
- Is Concept B better than Concept A?
- Can people find your call to action?
If people are remembering things like "blue", "blonde girl", and "ugly site", you know you’ve got some work to do (those aren’t far from real examples of what I’ve seen).
(2) Crazy Egg ($9)
Heat-mapping tools like Crazy Egg take user activity and translate it into visual maps, helping you to easily visualize how people interact with your site. Crazy Egg was founded by SEO wonder kid Neil Patel, and is an amazing bargain at $9/month. If you can’t bother to spend $9 on improving your website, feel free to stop reading this post. I’m serious – go buy a Venti Iced Mocha and a cookie instead of spending money on your business.
How It Works
This one’s a little bit trickier – you’ll have to install a JavaScript snippet similar to Google Analytics and other tools. Then, Crazy Egg starts tracking clicks on your specified page (try to stick to one page, as jumping pages can produce odd results).
What to Test
Crazy Egg not only allows you create to visual heat maps, but also has a "confetti" mode that lets you visualize clicks by segments, such as referring sources and new vs. returning visitors. Here are a few questions a heat-mapping tool can help you answer:
- Are people clicking where you want them to click?
- Is your navigation effective?
- Do you have too many choices?
- Do search visitors behave differently?
- Is your call to action getting clicks?
Although some heat-mapping tools can get bogged down in the visuals, I think that Crazy Egg has a very simple, elegant reporting approach that can give you solid insights quickly. Once you’ve gathered some initial impressions from Five Second Test and Crazy Egg, it’s time to do some real user testing…
(3) UserTesting.com ($195)
It used to be that user testing required a lab, expensive equipment, and a difficult recruiting process. Now, you can use remote testing services like UserTesting.com to get quick, inexpensive user feedback. While I won’t say it compares apples-to-apples to laboratory testing, I often find that the insights from even a handful of remote testing subjects can be incredibly useful.
How It Works
Setup is pretty straightforward, but doing it right can take a little bit of time. Technically, you just need to submit your URL and a few instructions to visitors. You pay $39 per visitor and receive both written feedback and an online video of the user walking through your site (with voice-over). Although this is a topic of some debate in the usability community, 5 users is a good number for uncovering core insights and getting solid bang for your buck.
What to Test
Take some time setting up your questions. Traditional usability tests are task-oriented – you tell someone to try to complete a task in a fairly open-ended fashion and watch them go to work. Be specific about the task and ask follow-up questions, like "Would you trust this site enough to make a purchase?" (I generally ask 3-4 follow-ups). A few questions this kind of qualitative testing can help you answer:
- Can people complete the task?
- How long does task completion take?
- Do users experience common stumbling blocks?
- What are visitors thinking out loud about?
- Does your search/navigation work as expected?
- Are you missing features people might be looking for?
- Do visitors get frustrated using your site?
Qualitative testing can be a great precursor to quantitative (A/B and multivariate) testing. Don’t throw design changes at the wall and see what sticks – put user testing to work to uncover hidden issues on your site. We all need a fresh pair (or 5 pairs) of eyes from time to time.
Here’s to $224 Well Spent
I’m an entrepreneur and a Bohemian – I understand that parting with money isn’t easy. The insights you’ll gain from just over $200, though, will, in my experience, easily yield 10X or even 100X back in online sales improvement. Solid qualitative data collection will also prevent you from making costly mistakes and will better inform how you look at your analytics and quantitative testing. There are plenty of good tools out there – choose a couple of them, and really put the effort into understanding how they work. You’ll be well rewarded.
An Interview on SEOBook
Posted by randfish
Just a short post tonight.
First, off, I’m honored to be interviewed by Aaron Wall. We’ve had our differences and maintain some divergent opinions on a few topics, but we both have an insane passion for helping make SEO professionals better at their job and work hard to grow the credibility of SEO as a whole.
Second – we’ve got a lot of reason to be thankful. SEOmoz was recently named the 334th fastest growing company in the US by Inc Magazine. I was named to Seattle’s 40 Under 40 List (I’m guessing it’s a typo) and we’ve recently passed 6,000 PRO subscribers (actually, we’re up over 6,300 as of today).

As amazing as all that is, nearly everyone at SEOmoz is thinking not about these milestones, but about one of our own – Jen Lopez – who noted on her Twitter feed that she’s out battling cancer. We are all with you Jen – every last one of us, with all our hearts. And we agree: #fuckcancer
Latent Dirichlet Allocation (LDA) and Google’s Rankings are Remarkably Well Correlated
Posted by randfish
Last week at our annual mozinar, Ben Hendrickson gave a talk on a unique methodology for improving SEO. The reception was overwhelming – I’ve never previously been part of a professional event where thunderous applause broke out not once but multiple times in the midst of a speaker’s remarks.

_
Ben Hendrickson speaking in last Fall at the Distilled/SEOmoz PRO Training London
(he’ll be returning this year)
_
I doubt I can recreate the energy and excitement of the 320-person filled room that day, but my goal in this post is to help explain the concepts of topic modeling, vector space models as they relate to information retrieval and the work we’ve done on LDA (Latent Dirichlet Allocation). I’ll also try to explain the relationship and potential applications to the practice of SEO.
A Request: Curiously, prior to the release of this post and our research publicly, there have been a number of negative remarks and criticisms from several folks in the search community suggesting that LDA (or topic modeling in general) is definitively not used by the search engines. We think there’s a lot of evidence to suggest engines do use these, but we’d be excited to see contradicting evidence presented. If you have such work, please do publish!
The Search Rankings Pie Chart
Many of us are likely familar with the ranking factors survey SEOmoz conducts every two years (we’ll have another one next year and I expect some exciting/interesting differences). Of course, we know that this aggregation of opinion is likely missing out on many factors and may over or under-emphasize the ones it does show.
Here’s an illustration I created for a presentation recently to help illustrate the major categories in the overall results:

This suggests that many SEOs don’t ascribe much weight to on-page optimization
_
I myself have often felt that from all the metrics, tests and observations of Google’s ranking results, the importance of on-page factors like keyword usage or TF*IDF (explained below) is fairly small. Certainly, I’ve not observed many results, even in low competitive spaces, where one can simply add in a few more repetitions of the keyword, maybe toss in a few synonyms or "related searches" and improve rankings. This experience, which many SEOs I’ve talked to share, has led me to believe that linking signals are an overwhelming majority of how the engines order results.
But, I love to be wrong.
Some of the work we’ve been doing around topic modeling, specifically using a process called LDA (Latent Dirichlet Allocation), has shown some surprisingly strong results. This has made me (and I think a lot of the folks who attended Ben’s talk last Tuesday) question whether it was simply a naive application of the concept of "relevancy" or "keyword usage" that gave us this biased perspective.
Why Search Engines Need Topic Modeling
Some queries are very simple – a search for "wikipedia" is non-ambiguous, straightforward and can be effectively returned by even a very basic web search engine. Other searches aren’t nearly as simple. Let’s look at how engines might order two results – a simple problem most of the time that can be somewhat complex depending on the situation.




For complex queries or when relating large quantities of results with lots of content-related signals, search engines need ways to determine the intent of a particular page. Simply because it mentions the keyword 4 or 5 times in prominent places or even mentions similar phrases/synonyms won’t necessarily mean that it’s truly relevant to the searcher’s query.
Historically, lots of SEOs have put effort into this process, so what we’re doing here isn’t revolutionary, and topic models, LDA included, have been around for a long time. However, no one in the field, to our knowledge, has made a topic modeling system public or compared its output with Google rankings (to help see how potentially influential these signals might be). The work Ben presented, and the really exciting bit (IMO), is in those numbers.
Term Vector Spaces & Topic Modeling
Term vector spaces, topic modeling and cosine similarity sound like a tough concepts, and when Ben first mentioned them on stage, a lot of the attendees (myself included) felt a bit lost. However, Ben (along with Will Critchlow, whose Cambridge mathematics degree came in handy) helped explain these to me, and I’ll do my best to replicate that here:

In this imaginary example, every word in the English language is related to either "cat" or "dog," the only topics available. To measure whether a word is more related to "dog," we use a vector space model that creates those relationships mathematically. The illustration above does a reasonable job showing our simplistic world. Words like "bigfoot" are perfectly in the middle with no more closeness to "cat" than to "dog." But words like "canine" and "feline" are clearly closer to one that the other and the degree of the angle in the vector model illustrates this (and gives us a number).
BTW - in an LDA vector space model, topics wouldn’t have exact label associations like "dog" and "cat" but would instead be things like "the vector around the topic of dogs."
Unfortunately, I can’t really visualize beyond this step, as it relies on taking the simple model above and scaling it to thousands or millions of topics, each of which would have its own dimension (and anyone who’s tried knows that drawing more than 3 dimensions in a blog post is pretty hard). Using this construct, the model can compute the similarity between any word or groups of words and the topics its created. You can learn more about this from Stanford University’s posting of Introduction to Information Retrieval, which has a specific section on Vector Space Models.
Correlation of our LDA Results w/ Google.com Rankings
Over the last 10 months, Ben (with help from other SEOmoz team members) has put together a topic modeling system based on a relatively simple implementation of LDA. While it’s certainly challenging to do this work, we doubt we’re the first SEO-focused organization to do so, though possibly the first to make it publicly available.
When we first started this research, we didn’t know what kind of an input LDA/topic modeling might have on search engines. Thus, on completion, we were pretty excited (maybe even ecstatic) to see the following results:
Correlation Between Google.com Rankings and Various Single Metrics

(the vertical blue bars indicate standard error in the diagram, which is relatively low thanks to the large sample set)
_
Using the same process we did for our release of Google vs. Bing correlation/ranking data at SMX Advanced (we posted much more detail on the process here), we’ve shown the Spearman correlations for a set of metrics familiar to most SEOs against some of the LDA results, including:
- TF*IDF – the classic term weighting formula, TF*IDF measures keyword usage in a more accurate way than a more primitive metric like keyword density. In this case, we just took the TF*IDF score of the page content that appeared in Google’s rankings
- Followed IPs – this is our highest correlated single link-based metric, and shows the number of unique IP addresses hosting a website that contains a followed link to the URL. As we’ve shown in the past, with metrics like Page Authority (which uses machine learning to build more complex ranking models) we can do even better, but it’s valuable in this context to just think and compare raw link numbers.
- LDA Cosine – this is the score produced from the new LDA labs tool. It measures the cosine similarity of topics between a given page or content block and the topics produced by the query.
The correlation with rankings of the LDA scores are uncanny. Certainly, they’re not a perfect correlation, but that shouldn’t be expected given the supposed complexity of Google’s ranking algorithm and the many factors therein. But, seeing LDA scores show this dramatic result made us seriously question whether there was causation at work here (and we hope to do additional research via our ranking models to attempt to show that impact). Perhaps, good links are more likely to point to pages that are more "relevant" via a topic model or some other aspect of Google’s algorithm that we don’t yet understand naturally biases towards these.
However, given that many SEO best practices (e.g. keywords in title tags, static URLs and ) have dramatically lower correlations and the same difficulties proving causation, we suspect a lot of SEO professionals will be deeply interested in trying this approach.
The LDA Labs Tool Now Available; Some Recommendations for Testing & Use
We’ve just recently made the LDA Labs tool available. You can use this to input a word, phrase, chunk of text or an entire page’s content (via the URL input box) along with a desired query (the keyword term/phrase you want to rank for) and the tool will give back a score that represents the cosine similarity in a percentage form (100% = perfect, 0% = no relationship).
When you use the tool, be aware of a few issues:
- Scores Change Slightly with Each Run
This is because, like a pollster interviewing 100 voters in a city to get a sense of the local electorate, we check a sample of the topics a content+query combo could fit with (checking every possibility would take an exceptionally long time). You can, therefore, expect the percentage output to flux 1-5% each time you check a page/content block against a query. - Scores are for English Only
Unfortunately, because our topics are built from a corpus of English language documents, we can’t currently provide scores for non-English queries. - LDA isn’t the Whole Picture
Remember that while the average correlation is in the 0.33 range, we shouldn’t expect scores for any given set of search results to go in precisely descending order (a correlation of 1.0 would suggest that behavior). - The Tool Currently Runs Against Google.com in the US only
You should be able to see the same results the tool extracts from by using a personalization-agnostic search string like http://www.google.com/xhtml?q=my+search&pws=0 - Using Synonyms, "Related Searches" or Wonder Wheel Suggestions May Not Help
Term vector models are more sophisticated representations of "concepts" and "topics," so while many SEOs have long recommended using synonyms or adding "related searches" as keywords on their pages and others have suggested the importance of "topically relevant content" there haven’t been great ways to measure these or show their correlation with rankings. The scores you see from the tool will be based on a much less naive interpretation of the connections between words than these classic approaches. - Scores are Relative (20% might not be bad)
Don’t presume that getting a 15% or a 20% is always a terrible result. If the folks ranking in the top 10 all have LDA scores in the 10-20% range, you’re likely doing a reasonable job. Some queries simply won’t produce results that fit remarkably well with given topics (which could be a weakness of our model or a weirdness about the query itself). - Our Topic Models Don’t Currently Use Phrases
Right now, the topics we construct are around single word concepts. We imagine that the search engines have probably gone above and beyond this into topic modeling that leverages multi-word phrases, too, and we hope to get there someday ourselves. - Keyword Spamming Might Improve Your LDA Score, But Probably Not Your Rankings
Like anything else in the SEO world, manipulatively applying the process is probably a terrible idea. Even if this tool worked perfectly to measure keyword relevance and topic modeling in Google, it would be unwise to simply stuff 50 words over and over on your page to get the highest LDA score you could. Quality content that real people actually want to find should be the goal of SEO and Google’s almost certainly sophisticated enough to determine the different between junk content that matches topic models and real content that real users will like (even if the tool’s scoring can’t do that).
If you’re trying to do serious SEO analysis and improvement, my suggested methodology is to build a chart something like this:

SERPs analysis of "SEO" in Google.com w/ Linkscape Metrics + LDA (click for larger)
Right now, you can use Keyword Difficulty’s export function and then add in some of these metrics manually (though in the future, we’re working towards building this type of analysis right into the web app beta).
Once you’ve got a chart like this, you can get a better sense of what’s propping up your competitors rankings – anchor text, domain authority, or maybe something related to topic modeling relevancy (which the LDA tool could help with).
Undoubtedly, Google’s More Sophisticated than This
While the correlations are high, and the excitement around the tool both inside SEOmoz and from a lot of our members and community is equally high, this is not us "reversing the algorithm." We may have built a great tool for improving the relevancy of your pages and helping to judge whether topic modeling is another component in the rankings, but it remains to be seen if we can simply improve scores on pages and see them rise in the results.
What’s exciting to us isn’t that we’ve found a secret formula (LDA has been written about for years and vector space models have been around for decades), but that we’re making a potentially valuable addition to the parts of SEO we’ve traditionally had little measurement around.
BTW – Thanks to Michael Cottam, who suggested the reference of research work by a number of Googlers on pLDA. There are hundreds of papers from Google and Microsoft (Bing) researchers around LDA-related topics, too, for those interested. Reading through some of these, you can see that major search engines have almost certainly built more advanced models to handle this problem. Our correlation and testing of the tool’s usefulness will show whether a naive implementation can still provide value for optimizing pages.
For those who’d like to investigate more, we’ve made all of our raw data available here (in XLS format, though you’ll need a more sophisticated model to do LDA). If you have interest in digging into this, feel free to email Ben at SEOmoz dot org.
How Do I Explain this to the Boss/Client?
The simplest method I’ve found is to use an analogy like:
If we want to rank well for "the rolling stones" it’s probably a really good idea to use words like "Mick Jagger," "Keith Richards," and "tour dates." It’s also probably not super smart to use words like "rubies," "emeralds," "gemstones," or the phrase "gathers no moss," as these might confuse search engines (and visitors) as to the topic we’re covering.
This tool tries to give a best guess number about how well we’re doing on this front vs. other people on the web (or sample blocks of words or content we might want to try). Hopefully, it can help us figure out when we’ve done something like writing about the Stones but forgetting to mention Keith Richards.
As always, we’re looking forward to your feedback and results. We’ve already had some folks write in to us saying they used the tool to optimize the contents of some pages and seen dramatic rankings boosts. As we know, that might not mean anything about the tool itself or the process, but it certainly has us hoping for great things.
p.s. The next step, obviously, is to produce a tool that can make recommendations on words to add or remove to help improve this score. That’s certainly something we’re looking into.
p.p.s. We’re leaving the Labs LDA tool free for anyone to use for a while, as we’d love to hear what the community thinks of the process and want to get as broad input as possible. Future iterations may be PRO-only.
Two Quick, Simple Social Media Tips
Posted by RobOusbey
Today, I want to share two pieces of advice that are particularly useful to certain types of business – and will be exceptionally quick to implement. I’ve also created a free download that might help some people implement one of these ideas even more quickly.
About two years ago, I made a recommendation to a client in the UK, and I’ve just seen it used by a hotel in the USA. If your business offers public computers with internet access – such as those in hotel lobbies, libraries, etc – this is for you:
Tip 1: Put up a sign, next to your public computers, with a call to action; typically this could be something like ‘Find us on Facebook’ or ‘Follow us on Twitter’.
Here’s such a poster in use, at the Ledgestone Hotel in Yakima. (Click the image to embiggen.)
Sadly, it doesn’t look like the Ledgestone is doing much with their Twitter account; this probably disappoints people who go to their page, and so they don’t end up with as many followers as they could do. Remember – getting people to your Twitter page (or Facebook, or whatever else you’re asking them to do) is only the first stage – there has to be something there for them when they arrive.
The second tip is more for people who offer wi-fi – this could be all manner of hotels, conference venues, airports, aeroplanes, train stations, coffee shops, etc. For places that offer free wi-fi, this can work even better:
Tip 2: You control the first page visitors see after logging on to your wi-fi. Don’t waste this with a dull message; make the page interesting, and put some calls to action on there.
People have probably logged on to do something – but many will welcome a distraction – particularly if you keep the request brief. Create a nicely styled, but simple page, and add a couple of message on there. Some examples could include:
- Follow us on Twitter / Like us on Facebook: you could incentivize this, for example: if you’re a coffee shop, then offer a free latte to new followers
- Sign up to our email newsletter: this will only take them a second if you make sure the form is right there on the page, and again this can be incentivized
- Don’t forget to check in on foursquare: ideal for almost any location, and this is as good a time as any to remind them to check in
- If you’re enjoying your stay, please review us: particularly useful for hotels, where online reviews can increase visibility; I’ll go into a little more detail about this below.
There can be some issues with sites noticing that a lot of people from the same IP are visiting, particularly when it comes to review services. Local search expert David Mihm advised me that he’s heard Yelp in particular does try to filter our multiple reviews from the same IP, and that TripAdvisor’s fraud rules do include clauses that might get you into trouble (such as offering incentives for people to write reviews is not permitted.)
I’d recommend that there are two steps around this type of issue:
- Try to appeal for reviews only from people who already have accounts on those sites (e.g.: "If you’re a Yelp member, please review us here…." or "If you have a Google account, please leave a review here…"
- Make this ‘post-wifi-login’ page available on the public internet; review sites should be able to recognize that lots of people are being referred to your page from the same URL – if it’s public then they’ll be able to visit that page, and should figure out what is going on.
I’ve built a quick free template for you to to download as a starting point. You can visit the file, or download it, by clicking this link: free wifi login CTA page.
(That was created based on a template from LayoutGala; I’m not going to add any licence to it, other than use it however you want. You should change the image that are in it to be local files at the very least.)
Honestly, it doesn’t take long to print off a couple of small posters (or even to publish a nice wifi login page) so I’ll hope to see social-media CTAs cropping up all over the place soon.



