Seo Reports

This blog offering news, resources and commentary regarding search engine optimization, business blog marketing, RSS and online public relations.

Tuesday, June 25, 2013

Pengium 2.0 done what the next

In a latest update Google damage all ranking of website and name as its major updates which is not actually update but change the whole algorithms. Now what is the next strategy for Google in updating the algorithms, there is something done on back end in Google company and after few months the update is live and than also struck all ranking for websites whether is good or bad.

Labels:

Monday, August 13, 2012

Features of Cloud Computing


Since the beginning of Information technology, cloud computing has played an important part in ensuring that companies and businesses run smoothly. IT  has provided various types of services which are secure, reliable, and available every time. In order to obtain the highest quality cloud computing, business owners have turned to its characteristics and features in order to acquire this service. Cloud computing has become attractive to end users and customers because of these salient characteristics.

Features of Cloud Computing
1.    Cloud hosting is for business entities which use resources heavily.
2.    Tight Security in Cloud Computing
3.    Ability to let the user decide on his current and future Requirements
4.    Cloud Computing is measurable: User  can obtain a specific number of software licenses as per his requirement and can fixed network bandwidth and data space which fits the business’s demands. 
5.    Cost Effective: You don’t have to spend money & your time on software & servers update. Developers Will maintain the updates of program on regular basis.
6.    Less Overheads: Its very convenient & cost effective because you are sharing infrastructure. No need to spend capital on infrastructure development. Users will be billed for the services which they have utilized
7.    Environment Friendly:  As large no of people are sharing common resources, it can help us to cut on global waste. It is very environment friendly as multiple users are sharing common resource.

           
Disadvantages of Cloud Computing:
Technical Issues:
1.    During Technical issues or when system is down it would be difficult to access your info. Any company can prone to outrages & you need good speed internet connection to use services
2.    Third Party Software: As you are aware you will transfer your company information to third party cloud computing service provider. It can be risk some time. You have to be very careful while choosing your service provider.
3.    Hacker Attack: Sometimes hacker attack on Cloud computing Service provide can have your company information vulnerable. There are always people trying to access your personal info.

Every product has some kind of  advantages & disadvantages but in coming days lots of users & organization will depend on cloud computing services.
To know more about Cloud computing Services in Galway visit

Labels:

Thursday, March 03, 2011

Why Hiring an SEO Company Can Lead to Better Results for Your Internet Marketing

These days, business owners tend to be more involved with internet marketing, since this will give them more opportunity to get more worldwide recognition for their business. This approach is not just to make a good website for a company, but also an excellent way to optimize it to guarantee that potential customers who may have never heard about their service or product might find it online. As a result, there is an increase in the demand for website optimization.

Search Engine Optimization or SEO is a process which helps a websites show up with greater frequency and even to appear on the top ten results on the search engine. However, getting into the SEO world will not be an easy task. Without the basic knowledge of what is involved it will be extremely difficult to do it effectively. The basics will lead to basic optimization and results, but the top 10 results that business want will only be accomplished with much work and more than just fundamental knowledge of search engine optimization. It will also be time-consuming if completed on your own, so it is best to hire one of the professionals.

What are the advantages of hiring a professional SEO Company?

For starters, the main benefit of hiring a search engine optimization company is that there are countless websites showing up every day, so you must have an up-to-date sites to ensure that you are appearing in the top results. When dealing with up-to-date websites, that is not only changing or adding pages, but also having the latest strategies on the site. In this instance, you will get even more opportunity for potential customers to visit your site. When you hire professionals you can depend on them to deal with your website requirements such as marketing, ranks, added content, and optimizations. In order to be highly recognized and remembered, SEO will not be a one-time job, but will require skill and devoted consistent time.

Needless to say, this will cost you some money but it will be worthwhile when you are reaping amazing benefits. Keep in mind that these are just some of the ways you can benefit from hiring a SEO company. Prior to hiring, it's vital that you research to ensure that you are choosing the right one. Work with a reputable service that can do all that is possible for a client and certainly for a very competitive cost.


Labels:

Thursday, February 18, 2010

People Search Engine - What You Have to Look For Before Using One

I was wondering about these so called people search databases and thought let me try them out and here are some important points to consider before using any service out there -
1. Free versus Paid

I found some Free Services out there to be really good. The only thing in a free service is, you just get an overview of a person's detail and if you want further details then you need to pay. If you are satisfied with what you find under the free service, then your work is done. I find these free services handy to just get an overview of a person before entering into any relationship. If you need to go into a deeper relationship on a personal level, then you need to consider the paid services where you get a more detailed report about the person.


Labels:

Thursday, November 09, 2006

Seo reports - ten factors for doing good seo

Reseller hosting review
Hi Friends, there are basically ten factors for doing complete seo as suggested by seo articles from seo reports. This online source for seo reports describe the ten seo factor below.

1. Title Meta Tag

The title tag is what displays as the headline in the SERPs (Search Engine Results Pages). It's also what displays in the top blue band of Internet Explorer when your site is displayed.Your title tag of your website should be easy to read and designed to bring in traffíc. By that, I mean that your main keyword phrase should be used toward the beginning of the tag. True, there are websites being found now that do not use the phrase in the title, but the vast majority still do as of this writing.Don't make the mistake of putting your company name first, unless you are already a household name, like Nascar or HBO. People are likely searching for what you have to offer, not your name.

2. Description Meta Tag

The description tag is the paragraph that people will see when your page comes up in the search results.

Your description tag should be captivating and designed to attract business. It should be easy to read, and compel the reader to act right now and follow your link. Without a description tag, search engines will frequently display thefirst text on your page. Is yours appropriate as a description of the page?

A proper description tag is what people will see below your title. You should make proper use of punctuation, and with readability, use your subject and geographical references.

These are what I believe to be the top 10 most important things (not necessarily in order) that you need, in order to get your website found in the search engines.
There are many other factors as well, but if you follow these guidelines, you'll stand a much better chance, and you'll be off to a good start.

1. Title Meta Tag

The title tag is what displays as the headline in the SERPs (Search Engine Results Pages). It's also what displays in the top blue band of Internet Explorer when your site is displayed.





Editorial Note: SEO-News will not be published (Nov. 16th) next week.
However, our radio show The Alternative, hosted by Jim Hedger, will be broadcast as usual, Thursday, 5pm Eastern GMT:-5 today and on Nov. 16th. Tune in to learn more about the world of independent search engine alternatives and developments at the Big Three.


Your title tag of your website should be easy to read and designed to bring in traffíc. By that, I mean that your main keyword phrase should be used toward the beginning of the tag. True, there are websites being found now that do not use the phrase in the title, but the vast majority still do as of this writing.

Don't make the mistake of putting your company name first, unless you are already a household name, like Nascar or HBO. People are likely searching for what you have to offer, not your name.

Your title tag should be written with a capital letter starting the tag, and followed by all lowercase letters, unless you're using proper nouns. Some people prefer to capitalize every word, too.

2. Description Meta Tag

The description tag is the paragraph that people will see when your page comes up in the search results.

Your description tag should be captivating and designed to attract business. It should be easy to read, and compel the reader to act right now and follow your link. Without a description tag, search engines will frequently display thefirst text on your page. Is yours appropriate as a description of the page?

A proper description tag is what people will see below your title. You should make proper use of punctuation, and with readability, use your subject and geographical references.




3. Keywords Meta Tag

The importance of Meta keyword tags fluctuates from month to month among different search engines. There is a debate in the SEO community as to whether or not they help at all on certain search engines. In fact, in the summer of 2004 it appeared as if they were losing importance altogether.

However, you'll Nevër be penalized on any search engines for using relevant targeted keywords in moderation, and they can only help you with most, especially Yahoo.

Avoid stuffing your keyword metatags with too many keywords. Just use relevant tags that apply directly to the content of that particular page, and don't overdo it.

4. ALT Tags

The small yellow box that comes up when your mouse cursor is placed over an image is called the ALT tag. Every relevant image should have an alt tag with your keywords or phrases mentioned in the tag.

A proper ALT tag goes after the file name, and before the Align indicator.

* - The ALT tag is no longer being considered for ranking purposes by some search engines. That said, it still cannot HURT you, and will still help you with some engines. My recommendation is to continue to use them, but be sure to avoid keyword stuffing. Besides, who knows when the pendulum will swing back the other way?

5. Header Tags

The text of each page is given more weíght by the search engines if you make use of header tags and then use descriptive body text below those headers. Bullet points work well too. It is not enough to merely BOLD or enlärge your text headlines.

6. Link Text

Search engine spiders cannot follow image links. In addition to having image links or buttons on your web pages, you should have text links at the bottom or elsewhere. The text that the user sees when looking at the link is called the link text. A link that displays products does not carry as much weíght to the search engines as a link called oregon widgets. Link text is very important and is actually one of the most frequently overlooked aspects of web design that I've seen.

7. Site Map

Using a site map not only makes it easy for your users to see the entire structure of your website, but it also makes it easier for the search engines to spider your site. When the search engine spiders come to visit, they will follow all of the text links from your main index page. If one of those links is to a site map, then the spiders will go right to the sitemap, and consequently visit every page you have text linked to from that site map. On the site map page, try to have a sentence or two describing each page, and not just a page of links.

8. Relevant Inbound Links

By relevant, I mean similar industry or subject related sites. Right now, no single strategy can get your site ranked higher faster than being linked to by dozens of other relevant websites. It used to be that the quantity of incoming links mattered most, but today, it's much better to have three highly relevant links to you from other popular related websites than 30 links from unrelated low ranked sites. If there are other businesses in your industry that you can trade links with, it will help your site enormously. Link to others, and have them link to you. It's proven, and it works. To see who's linking to you, in Google type the following...links: yourdomain.com

9. Your Content

Not to be forgotten of course, is the actual content of your webpage. It must be relevant helpful information that people want to read. These days, each webpage should be laser focused on one specific product or subject in order to rank highly for that search phrase. The days of writing one webpage to appeal to dozens of search terms are long gone. Ideally, each page should have between 400 to 650 words on it. Too few, and the search engines won't consider it to be relevant enough. Too many words and the search engine spiders may have a hard time determining the actual subject or focus of the page.

Use your keywords or phrases often, and use them at the beginning of your paragraphs wherever possible. Don't overuse them and make the page sound phony, but don't write a page about a certain subject, and not mention that subject repeatedly either. Reading it out loud to yourself is a great way to judge how natural your text sounds.

Concentrate on writing quality pages that actually appeal to the human reader. Write pages that provide the reader with exactly what they are looking for; that is, information about the exact search phrase they've entered.

10. Avoid Cheating

With all of these tidbits of information, it's tempting to think that you can stuff 100 keywords into your title, or create a page with the phrase oregon widget company being used 100 times in headers, text links, ALT tags, bullet points etc. but that cannot help you. In fact, it can penalize you, and get your website banned from certain search engines.

Thursday, October 19, 2006

Clíck Fraud – Google and yahoo

Some call it clíck fraud and some call it invalid clíck activity but nobody questíons that questionable clicks happen. How they happen and to what degree might be up for debate but the existence of industrial level skullduggery is a widely known but poorly recognized secrët of the search marketing industry.
This dirty little secrët is made up of hundreds of millíons of dollars, which are unevenly split between tens of thousands of participants. Some astute observers might suggest this constitutes a conspiracy to commit fraud and is therefore illegal behavior. In that, they would be half right. Clíck fraud isn't actually illegal, at least not in a criminal sense. Not yet anyway.

Google knows about issues associated with clíck fraud though it doesn't like using the term "fraud". The popularity of AdWords and ease of access to Google's lucrative AdSense distribution system gave online scammers the two essential elements, motive and opportuníty. Google has been playing catch-up ever since. Yahoo knows a lot about clíck fraud as well. Extremely credible allegations have been made about Yahoo's commercial connections with clíck fraud artists. Many in the shadier side of the search marketing industry are in on the clíck-fraud secrët as well, so much so that they publish how-to guides and self-congratulatory blogs. Frighteningly, spyware and malware makers are also in on the secrët, often benefiting by becoming unwanted content-delivery partners.

Everyone's making a lot of monëy and, like the great ínsurance scams of yesteryear, the costs to the ultimate victims are so widely distributed they often seem negligible to individuals affected. Negligible is, of course, a relative term. Is $10 negligible to you? How about $100 or $1000 or more?

Another group of people who've learned a great deal about clíck fraud are the victims, which tend to be small online businesses. Unfortunately, it is harder to track the impact on smaller businesses because they often lack the resources to properly police their own web and account logs. SEO-News.com editor, Kim Roach, wrote a strong piece titled, "Going Broke on Google AdWords" in SiteProNews last week.

Victims of Clíck Fraud come in all sizes. While Kim's article addressed damage done to smaller businesses, a July 2006 report by online industry research firm Outsell estimated clíck fraud accounted for $800 million of a total of $5.5 billion spent on search engine advertising in 2005. Outsell reached these figures by studying 407 online advertisers that collectively controlled about $1 billion in ad spending.

Google disputes these figures along with other figures reported by San Antonio based PPC-auditing firm Clíck Forensics in their monthly Clíck Fraud Index, questioning the methodology used in determining incidents of billed clíck fraud.

In an August 8, 2006 report titled, "How Fictitious Clicks Occur in Third-Party Clíck Fraud Audit Reports" Google's Clíck Quality Team found two serious issues served to consistently inflate incidents of "fictitious clicks" recorded in third-party analysis.

According to their findings, "The major root causes for fictitious clicks falls into the two following categories: ... detection of page reloads as ad-clicks", and, "... conflation across advertisers and ad networks, or the counting of one advertiser's traffíc in another advertiser's reports."

The first cause of "fictitious clicks" noted by Google's team suggests that sometimes users reload pages; use their back buttons, or open landing pages in new windows. Each action could be interpreted as a second, third or fourth clíck where only one should have been recorded.

The second cause of "fictitious clicks" is a bit more confusing, stemming from the massive AdSense system Google uses to distribute AdWords advertising. Google has thousands, (perhaps millíons) of AdSense distribution partners. Some of these partners are extremely large corporations though most are independent webmasters. The largest of these partners often have complicated advertising systems that mix and match different types of ads (banners, AdWords, text-links, etc...) from different advertising platforms. They have their own internal system for measuring the effectiveness of these ads and sometimes those internal systems cause subtle incongruities in tracking ad-impressions. One such weirdness is the way some hits from the AOL network are grouped together under a single block of IP addresses.

Google's business product manager for trust and safety, Shuman Ghosemajumder, wrote a long post "About Invalid Clicks" to the Inside AdWords blog in early August. In it, Ghosemajumder suggests that surveys and reports from many click-analysts, including Clíck Forensics, do not give a full view of factors as seen by Google, or in some cases, by the analysts' own clients.

Ghosemajumder says Google's internal filters catch and cancel charges for the vast majority of invalid clicks, even though evidence of those clicks might be reflected in advertisers' log files.

"If an advertiser is monitoring clíck activity, these automatically filtered clicks may show up in an advertiser's logs, but not in their bills. When invalid clicks are detected after an advertiser is charged, we reimburse for them. Because of our detection efforts, losses to advertisers from invalid clicks are very small," Ghosemajumder's comments.

Moreover", he continues, "the study does not indicate whether the advertiser was actually charged for any of the clicks, only that the traffíc analysis suggested that the clicks may have been invalid."

It isn't just the number of billed clicks that have advertisers and analysts concerned. Another report produced by Clíck Forensics showed that of 170 advertisers in the financial services sector, over 6% reported recorded visits from sites clearly identified as parked domains.

Dozens of third-party firms engage in ad-stuffing documents on parked domains. If the general consensus that an invalid clíck is one that is reasonably unlikely to lead to a potential conversion, then counted clicks from parked domains are assumed by many analysts to be invalid.

Parked domains are housed at URLs that, while owned, are not really operating as most Internet users might expect. A page at parked domain is generally one containing very limited information and a number of PPC Ads. The page is basically a placeholder being used as a billboard. Believe it or not, parked domains are said to drive a significant amount of traffíc, especially under highly searched terms relating to health, travel and financial services. Given the number of clíck-throughs found in a report covering 170 financial services companies, 6% represents a huge number of clicks, each costing advertisers ten cents or more.

Jessie Stricchiola is considered one of the leading experts in clíck fraud detection and analysis. On the website of her firm, Alchemist Media, Sticchiola writes about an automated form of third-party clíck fraud that was publicly associated with Yahoo by clíck fraud researcher Ben Edelman.

"Another method of fraudulent clicking is initiated through automated clíck generation methods, using "hitbots" - software applications specifically designed to clíck on paid listings. This kind of activity is also initiated by both competitors and by search engine partners and/or affiliates, the latter often instituting extensive technology arrangements to enable their fraudulent clíck traffíc to slip past the internal filtering methods used by the CPC engines. For CPC affiliates, there is a vested interest in generating as much traffíc as possible to increase their portion of the shared revenue generated by paid listings. This is an often overlooked source of fraudulent clíck activity."

The examples cited in this article are only some of the ways used by the unscrupulous to scam unsuspecting advertisers in an environment that is extremely difficult to police properly. Indeed, in the absence of any external oversight, the only real watchdogs are the search engines themselves. Clíck auditors working with advertisers are, apparently, not privy to information used by Google and Yahoo to examine the issue and, therefore, are unable to effectively create reports recognized by the two largest search advertising providers.

While Google and Yahoo both claim to be working on finding solutions to the problems posed by invalid clicks, both admit they are months or even years from finding foolproof solutions. That both make a great deal of monëy from invalid clicks, clicks directed from parked domains, and other third-party shenanigans has lead some to speculate that the major search engines might not be approaching the problem with missionary zeal. There is a lot of murky ground. The only obvious things about the issue is that there is an issue and that a solution, though not immediately forthcoming, is desperately needed.

Thursday, September 14, 2006

A search engine based on expert documents

In response to a query a search engine returns a ranked list of documents. If the query is broad (i.e., it matches many documents) then the returned list is usually too long to view fully. Studies show that users usually look at only the top 10 to 20 results. In this paper, we propose a novel ranking scheme for broad queries that places the most authoritative pages on the query topic at the top of the ranking. Our algorithm operates on a special index of "expert documents." These are a subset of the pages on the WWW identified as directories of links to non-affiliated sources on specific topics. Results are ranked based on the match between the query and relevant descriptive text for hyperlinks on expert pages pointing to a given result page. We present a prototype search engine that implements our ranking scheme and discuss its performance. With a relatively small (2.5 million page) expert index, our algorithm was able to perform comparably on broad queries with the best of the mainstream search engines.
1 Introduction
When searching the WWW broad queries tend to produce a large result set. This set is hard to rank based on content alone, since the quality and "authoritativeness" of a page (namely, a measure of how authoritative the page is on the subject) cannot be assessed solely by analyzing its content. In traditional information retrieval we make the assumption that the articles in the corpus originate from a reputable source and all words found in an article were intended for the reader. These assumptions do not hold on the WWW since content is authored by sources of varying quality and words are often added indiscriminately to boost the page's ranking. For example, some pages are created to purposefully mislead search engines, and are known popularly as "spam" pages. The most virulent of spam techniques involves deliberately returning someone else's popular page to search engine robots instead of the actual page, to steal their traffic. Even when there is no intention to mislead search engines, the WWW tends to be crowded with information on topics popular with users. Consequently, for broad queries keyword matching seems inadequate.
Prior approaches that have used content analysis to rank broad queries on the WWW cannot distinguish between authoritative and non-authoritative pages (e.g., they fail to detect spam pages). Hence the ranking tends to be poor and search services have turned to other sources of information besides content to rank results. We next describe some of these ranking strategies, followed by our new approach to authoritative ranking - which we call Hilltop.

1.1 Related Work
Three approaches to improve the authoritativeness of ranked results have been taken in the past:
Ranking Based on Human Classification: Human editors have been used by companies such as Yahoo! and Mining Company to manually associate a set of categories and keywords with a subset of documents on the web. These are then matched against the user's query to return valid matches. The trouble with this approach is that: (a) it is slow and can only be applied to a small number of pages, and (b) often the keywords and classifications assigned by the human judges are inadequate or incomplete. Given the rate at which the WWW is growing and the wide variation in queries this is not a comprehensive solution.

Ranking Based on Usage Information: Some services such as DirectHit collect information on: (a) the queries individual users submit to search services and (b) the pages they look at subsequently and the time spent on each page. This information is used to return pages that most users visit after deploying the given query. For this technique to succeed a large amount of data needs to be collected for each query. Thus, the potential set of queries on which this technique applies is small. Also, this technique is open to spamming.

Ranking Based on Connectivity: This approach involves analyzing the hyperlinks between pages on the web on the assumption that: (a) pages on the topic link to each other, and (b) authoritative pages tend to point to other authoritative pages.

PageRank [Page et al 98] is an algorithm to rank pages based on assumption b. It computes a query-independent authority score for every page on the Web and uses this score to rank the result set. Since PageRank is query-independent it cannot by itself distinguish between pages that are authoritative in general and pages that are authoritative on the query topic. In particular a web-site that is authoritative in general may contain a page that matches a certain query but is not an authority on the topic of the query. In particular, such a page may not be considered valuable within the community of users who author pages on the topic of the query.

An alternative to PageRank is Topic Distillation [Kleinberg 97, Chakrabarti et al 98, Bharat et al 98, Chakrabarti et al 99]. Topic distillation first computes a query specific subgraph of the WWW. This is done by including pages on the query topic in the graph and ignoring pages not on the topic. Then the algorithm computes a score for every page in the subgraph based on hyperlink connectivity: every page is given an authority score. This score is computed by summing the weights of all incoming links to the page. For each such reference, its weight is computed by evaluating how good a source of links the referring page is. Unlike PageRank, Topic Distillation is only applicable to broad queries, since it requires the presence of a community of pages on the topic.

A problem with Topic Distillation is that computing the subgraph of the WWW which is on the query topic is hard to do in real-time. In the ideal case every page on the WWW that deals with the query topic would need to be considered. In practice an approximation is used. A preliminary ranking for the query is done with content analysis. The top ranked result pages for the query are selected. This creates a selected set. Then, some of the pages within one or two links from the selected set are also added to the selected set if they are on the query topic. This approach can fail because it is dependent on the comprehensiveness of the selected set for success. A highly relevant and authoritative page may be omitted from the ranking by this scheme if it either did not appear in the initial selected set, or some of the pages pointing to it were not added to the selected set. A "focused crawling" procedure to crawl the entire web to find the complete subgraph on the query's topic has been proposed [Chakrabarti et al 99] but this is too slow for online searching. Also, the overhead in computing the full subgraph for the query is not warranted since users only care about the top ranked results.

1.2 Hilltop Algorithm Overview
Our approach is based on the same assumptions as the other connectivity algorithms, namely that the number and quality of the sources referring to a page are a good measure of the page's quality. The key difference consists in the fact that we are only considering "expert" sources - pages that have been created with the specific purpose of directing people towards resources. In response to a query, we first compute a list of the most relevant experts on the query topic. Then, we identify relevant links within the selected set of experts, and follow them to identify target web pages. The targets are then ranked according to the number and relevance of non-affiliated experts that point to them. Thus, the score of a target page reflects the collective opinion of the best independent experts on the query topic. When such a pool of experts is not available, Hilltop provides no results. Thus, Hilltop is tuned for result accuracy and not query coverage.
Our algorithm consists of two broad phases:

(i) Expert Lookup

We define an expert page as a page that is about a certain topic and has links to many non-affiliated pages on that topic. Two pages are non-affiliated conceptually if they are authored by authors from non-affiliated organizations. In a pre-processing step, a subset of the pages crawled by a search engine are identified as experts. In our experiment we classified 2.5 million of the 140 million or so pages in AltaVista's index to be experts. The pages in this subset are indexed in a special inverted index.

Given an input query, a lookup is done on the expert-index to find and rank matching expert pages. This phase computes the best expert pages on the query topic as well as associated match information.

(ii) Target Ranking

We believe a page is an authority on the query topic if and only if some of the best experts on the query topic point to it. Of course in practice some expert pages may be experts on a broader or related topic. If so, only a subset of the hyperlinks on the expert page may be relevant. In such cases the links being considered have to be carefully chosen to ensure that their qualifying text matches the query. By combining relevant out-links from many experts on the query topic we can find the pages that are most highly regarded by the community of pages related to the query topic. This is the basis of the high relevance that our algorithm delivers.

Given the top ranked matching expert-pages and associated match information, we select a subset of the hyperlinks within the expert-pages. Specifically, we select links that we know to have all the query terms associated with them. This implies that the link matches the query. With further connectivity analysis on the selected links we identify a subset of their targets as the top-ranked pages on the query topic. The targets we identify are those that are linked to by at least two non-affiliated expert pages on the topic. The targets are ranked by a ranking score which is computed by combining the scores of the experts pointing to the target.

1.3 Roadmap
The rest of the paper is organized as follows: Section 2 describes the selection and indexing of expert documents; Section 3 provides a detailed description of the ranking scheme used in query processing; Section 4 presents a user-based evaluation of our prototype implementation; and Section 5 concludes the paper.
2 Expert Documents
Broad subjects are well represented on the Web and as such are also likely to have numerous human-generated lists of resources. There is value for the individual or organization that creates resource lists on specific topics since this boosts their popularity and influence within the community interested in the topic. The authors of these lists thus have an incentive to make their lists as comprehensive and up to date as possible. We regard these links as recommendations, and the pages that contain them, as experts. The problem is, how can we distinguish an expert from other types of pages? In other words what makes a page an expert? We felt than an expert page needs to be objective and diverse: that is, its recommendations should be unbiased and point to numerous non-affiliated pages on the subject. Therefore, in order to find the experts, we needed to detect when two sites belong to the same or related organizations.
2.1 Detecting Host Affiliation
We define two hosts as affiliated if one or both of the following is true:
They share the same first 3 octets of the IP address.
The rightmost non-generic token in the hostname is the same.
We consider tokens to be substrings of the hostname delimited by "." (period). A suffix of the hostname is considered generic if it is a sequence of tokens that occur in a large number of distinct hosts. E.g., ".com" and ".co.uk" are domain names that occur in a large number of hosts and are hence generic suffixes. Given two hosts, if the generic suffix in each case is removed and the subsequent right-most token is the same, we consider them to be affiliated.

E.g., in comparing "www.ibm.com" and "ibm.co.mx" we ignore the generic suffixes ".com" and ".co.mx" respectively. The resulting rightmost token is "ibm", which is the same in both cases. Hence they are considered to be affiliated. Optionally, we could require the generic suffix to be the same in both cases.

The affiliation relation is transitive: if A and B are affiliated and B and C are affiliated then we take A and C to be affiliated even if there is no direct evidence of the fact. In practice some non-affiliated hosts may be classified as affiliated, but that is acceptable since this relation is intended to be conservative.

In a preprocessing step we construct a host-affiliation lookup. Using a union-find algorithm we group hosts, that either share the same rightmost non-generic suffix or have an IP address in common, into sets. Every set is given a unique identifier (e.g., the host with the lexicographically lowest hostname). The host-affiliation lookup maps every host to its set identifier or to itself (when there is no set). This is used to compare hosts. If the lookup maps two hosts to the same value then they are affiliated; otherwise they are non-affiliated.

2.2 Selecting the Experts
In this step we process a search engine's database of pages (we used AltaVista's crawl from April 1999) and select a subset of pages which we consider to be good sources of links on specific topics, albeit unknown. This is done as follows:
Considering all pages with out-degree greater than a threshold, k (e.g., k=5) we test to see if these URLs point to k distinct non-affiliated hosts. Every such page is considered an expert page.

If a broad classification (such as Arts, Science, Sports etc.) is known for every page in the search engine database then we can additionally require that most of the k non-affiliated URLs discovered in the previous step point to pages that share the same broad classification. This allows us to distinguish between random collections of links and resource directories. Other properties of the page such as regularity in formatting can be used as well.

2.3 Indexing the Experts
To locate expert pages that match user queries we create an inverted index to map keywords to experts on which they occur. In doing so we only index text contained within "key phrases" of the expert. A key phrase is a piece of text that qualifies one or more URLs in the page. Every key phrase has a scope within the document text. URLs located within the scope of a phrase are said to be "qualified" by it. For example, the title, headings (e.g., text within a pair of

tags) and anchor text within the expert page are considered key phrases. The title has a scope that qualifies all URLs in the document. A heading's scope qualifies all URLs until the next heading of the same or greater importance. An anchor's scope only extends over the URL it is associated with.
The inverted index is organized as a list of match positions within experts. Each match position corresponds to an occurrence of a certain keyword within a key phrase of a certain expert page. All match positions for a given expert occur in sequence for a given keyword. At every match position we also store:

An identifier to identify the phrase uniquely within the document
A code to denote the kind of phrase it is (title, heading or anchor)
The offset of the word within the phrase.
In addition, for every expert we maintain the list of URLs within it (as indexes into a global list of URLs) and for each URL we maintain the identifiers of the key phrases that qualify it.
To avoid giving long key phrases an advantage, the number of keywords within any key phrase is limited (e.g., to 32).

3 Query Processing
In response to a user query, we first determine a list of N experts that are the most relevant for that query. E.g. N = 200 in our experiment. Then, we rank results by selectively following the relevant links from these experts and assigning an authority score to each such page. In this section we describe how the expert and authority scores are computed.
3.1 Computing the Expert Score
For an expert to be useful in response to a query, the minimum requirement is that there is at least one URL which contains all the query keywords in the key phrases that qualify it. A fast approximation is to require all query keywords to occur in the document. Furthermore, we assign to each candidate expert a score reflecting the number and importance of the key phrases that contain the query keywords, as well as the degree to which these phrases match the query.
Thus, we compute the score of an expert as as a 3-tuple of the form (S0, S1, S2). Let k be the number of terms in the input query, q. The component Si of the score is computed by considering only key phrases that contain precisely k - i of the query terms. E.g., S0 is the score computed from phrases containing all the query terms.

Si = SUM{key phrases p with k - i query terms} LevelScore(p) * FullnessFactor(p, q)

LevelScore(p) is a score assigned to the phrase by virtue of the type of phrase it is. For example, in our implementation we use a LevelScore of 16 for title phrases, 6 for headings and 1 for anchor text. This is based on the assumption that the title text is more useful than the heading text, which is more useful than an anchor text match in determining what the expert page is about.

FullnessFactor(p, q) is a measure of the number of terms in p covered by the terms in q. Let plen be the length of p. Let m be the number of terms in p which are not in q (i.e., surplus terms in the phrase). Then, FullnessFactor(p, q) is computed as follows:

If m <= 2, FullnessFactor(p, q) = 1
If m > 2, FullnessFactor(p, q) = 1 - (m - 2) / plen
Our goal is to prefer experts that match all of the query keywords over experts that match all but one of the keywords, and so on. Hence we rank experts first by S0. We break ties by S1 and further ties by S2. The score of each expert is converted to a scalar by the weighted summation of the three components:
Expert_Score = 232 * S0 + 216 * S1 + S2.




3.2 Computing the Target Score
We consider the top N experts by the ranking from the previous step (e.g., the top 200) and examine the pages they point to. These are called targets. It is from this set of targets that we select top ranked documents. For a target to be considered it must be pointed to by at least 2 experts on hosts that are mutually non-affiliated and are not affiliated to the target. For all targets that qualify we compute a target score reflecting both the number and relevance of the experts pointing to it and the relevance of the phrases qualifying the links.
The target score T is computed in three steps:

For every expert E that points to target T we draw a directed edge (E,T). Consider the following "qualification" relationship between key phrases and edges:
The title phrase qualifies all edges coming out of the expert
A heading qualifies all edges whose corresponding hyperlinks occur in the document after the given heading and before the next heading of equal or greater importance.
A hyperlink's anchor text qualifies the edge corresponding to the hyperlink.

For each query keyword w, let occ(w, T) be the number of distinct key phrases in E that contain w and qualify the edge (E,T). We define an "edge score" for the edge (E,T) represented by Edge_Score(E,T), which is computed thus:
If occ(w, T) is 0 for any query keyword then the Edge_Score(E,T) = 0.
Otherwise, Edge_Score(E,T) = Expert_Score(E) * Sum{query keywords w} occ(w, T)


We next check for affiliations between expert pages that point to the same target. If two affiliated experts have edges to the same target T, we then discard one of the two edges. Specifically, we discard the edge which has the lower Edge_Score of the two.
To compute the Target_Score of a target we sum the Edge_Scores of all edges incident on it.
The list of targets is ranked by Target_Score. Optionally, this list can be filtered by testing if the query keywords are present in the targets. Optionally, we can match the query keywords against each target to compute a Match_Score using content analysis, and combine the Target_Score with the Match_Score before ranking the targets.

4 Evaluation
In order to evaluate our prototype search engine, we conducted two user studies aiming to estimate the recall and precision. Both experiments also involved three other search engines, namely AltaVista, DirectHit and Google, for comparison and were done in August 1999. Note that the current rankings by these engines may differ.
4.1 Locating Specific Popular Targets
For the first experiment we asked seven volunteers to suggest the home pages of ten organizations of their choice (companies, universities, stores, etc.). Some of the queries are reproduced in the table below:


Alpha Phi Omega Best Buy Digital Disneyland
Dollar Bank Grouplens INRIA Keebler
Mountain View Public Library Macy's Minneapolis City Pages Moscow Aviation Institute
MENSA OCDE ONU Pittsburg Steelers
Pizza Hut Rice University SONY Safeway
Stanford Shopping Center Trek Bicycle USTA Vanguard Investments

The same query was sent to all four search engines. We assume that there is exactly one home page in each case. Every time the home page was found within the first ten results, its rank was recorded. Figure 2 summarizes the average recall for the ranks 1 to 10 for each of the four engines: our engine Hilltop (HT), Google (GG), AltaVista (AV), and DirectHit (DH). Average recall at rank k for this experiment is the probability of finding the desired home page within the first k results.



Figure 2. Average Recall vs. Rank

Our engine performed well on these queries. Thus, for about 87% of the queries, Hilltop returned the desired page as the first result, comparable with Google at 80% of the queries, while DirectHit and AltaVista succeeded at rank 1 only in 43% and 20% of the cases, respectively. As we look at more results, the average recall increases to 100% for Google, 97% for Hilltop, 83% for DirectHit, and 30% for AltaVista.

4.2 Gathering Relevant Pages
In order to estimate Hilltop's ability to generate a good first page of results for broad queries, we asked our volunteers to think of broad topics (i.e., topics for which it is likely that many good pages exist) and formulate queries. We collected 25 such queries, listed below:


Aerosmith Amsterdam backgrounds chess dictionary
fashion freeware FTP search Godzilla Grand Theft Auto
greeting cards Jennifer Love Hewitt Las Vegas Louvre Madonna
MEDLINE MIDI newspapers Paris people search
real audio software Starr report tennis UFO

We then used a script to submit each query to all four search engines and collect the top 10 results from each engine, recording for each result the URL, the rank, and the engine that found it. We needed to determine which of the results were relevant in an unbiased manner. For each query we generated the list of unique URLs in the union of the results from all engines. This list was then presented to a judge in a random order, without any information about the ranks of page or their originating engine. The judge rated each page for relevance to the given query on a binary scale (1 = "good page on the topic", 0 = "not relevant or not found"). Then, another script combined these ratings with the information about provenance and rank and computed the average precision at rank k (for k = 1, 5, and 10). The results are summarized in Figure 3.



Figure 3. Average Precision at Rank k
These results indicate that for broad subjects our engine returns a large percentage of highly relevant pages among the ten best ranked pages, comparable with Google and DirectHit, and better than AltaVista. At rank 1 both Hilltop and DirectHit have an average precision of 0.92. Average precision at 10 for Hilltop was 0.77, roughly equal to the best search engine, namely Google, with a precision of 0.79 at rank 10.

5 Conclusions
We described a new ranking algorithm for broad queries called Hilltop and the implementation of a search engine based on it. Given a broad query Hilltop generates a list of target pages which are likely to be very authoritative pages on the topic of the query. This is by virtue of the fact that they are highly valued by pages on the WWW which address the topic of the query. In computing the usefulness of a target page from the hyperlinks pointing to it, we only consider links originating from pages that seem to be experts. Experts in our definition are directories of links pointing to many non-affiliated sites. This is an indication that these pages were created for the purpose of directing users to resources, and hence we regard their opinion as valuable. Additionally, in computing the level of relevance, we require a match between the query and the text on the expert page which qualifies the hyperlink being considered. This ensures that hyperlinks being considered are on the query topic. For further accuracy, we require that at least 2 non-affiliated experts point to the returned page with relevant qualifying text describing their linkage. The result of the steps described above is to generate a listing of pages that are highly relevant to the user's query and of high quality.
Hilltop most resembles the connectivity techniques, PageRank and Topic Distillation. Unlike PageRank our technique is a dynamic one and considers connectivity in a graph specifically about the query topic. Hence, it can evaluate relevance of content from the point of view of the community of authors interested in the query topic. Unlike Topic Distillation we enumerate and consider all good experts on the subject and correspondingly all good target pages on the subject. In order to find the most relevant experts we use a custom keyword-based approach, focusing only on the text that best captures the domain of expertise (the document title, section headings and hyperlink anchor-text). Then, in following links, we boost the score of those targets whose qualifying text best matches the query. Thus, by combining content and connectivity analysis, we are both more comprehensive and more precise. An important property is that unlike Topic Distillation approaches, we can prove that if a page does not appear in our output it lacks the connectivity support to justify its inclusion. Thus we are less prone to omit good pages on the topic, which is a problem with Topic Distillation systems. Also, since we use an index optimized to finding experts, our implementation uses less data than Topic Distillation and is therefore faster.

The indexing of anchor-text was first suggested in WWW Worm [McBryan 94]. In some Topic Distillation systems such as Clever [Chakrabarti et al 1998] and in the Google search engine [Page et al 98] anchor-text is considered in evaluating a link's relevance. We generalize this to other forms of text that are seen to "qualify" a hyperlink at its source, and include headings and title-text as well. Also, unlike Topic Distillation systems, we evaluate experts on their content match to the user's query, rather than on their linkage to good target pages. This prevents the scores of "niche experts" (i.e., experts that point to new or relative poorly connected pages) from being driven to zero, as is often the case in Topic Distillation algorithms.

In a blind evaluation we found that Hilltop delivers a high level of relevance given broad queries, and performs comparably to the best of the commercial search engines tested.