Study of Search Engine Performance
Purpose
Search engines are tools designed to help people find information on the Internet. More than half of the 100 million individuals who look for health information on the Internet use search engines to do so (Taylor, 2001). Many Internet users search with just a few engines, although over 3600 different ones exist (Search Engine Guide, 2001). Most search engines use a combination of methods to identify potentially relevant material: a crawler component (a computer generated index of Web sites) as well as a directory component (a manually edited index of Web sites) (Search Engine Watch, 2001). After identifying material, search engines employ different methods to rank the sites they have found. Examples of ranking methods include:
In our assessment of search engine performance, we focused on two basic issues. Specifically, we wanted to know:
Methods
Because this study is one of the first to attempt to characterize the way in which search engines operate to help consumers find Online health information, we designed a structured evaluation of each medical condition rather than observing the experiences of typical users. This section describes how we selected the 14 search engines included in the study, the methods that were used to evaluate the performance of those search engines, and the analytic techniques used to assess the results. Figure 2.1 illustrates the flow of the search engine study.
Selecting Search Engines. Working with three search engine experts[1], 10 English-language and 4 Spanish-language search engines were selected for this part of the study. We chose search engines based on either their popularity (number of unique visitors per month as reported by Media Metrix, Inc in June 2000) or based on the method by which the search engine ranked Web sites. Three of the English-language and two of the Spanish-language search engines were chosen based on popularity. The remaining 9 search engines were selected because they featured unique ranking methods of ranking Web sites.
The ten English-language search engines selected were: Altavista, Ask Jeeves, Direct Hit, Excite, Google, Goto, Lycos, Metacrawler, Northern Light and Yahoo. The four Spanish-language search engines selected were: Quepasa, TeRespondo, Yahoo Espanol, and Yupi. Tables 2.1 and 2.2 describe some of the characteristics of these search engines
Training Searchers and Coders
The first phase of our study involved searching the selected search engines for links addressing the selected medical conditions and then categorizing those pages found by the relevance of their content. In order to do so, five experienced computer users with college degrees but no medical training served as searchers for the English-language search engines. All had some previous experience using computers and the Internet. Two fluent Spanish-speakers from UCLA and RAND served as the searchers for the Spanish-language search engines.
During a full day training session using CatchtheWeb® Software (described in detail below), searchers learned how to define and identify relevant links and save this information into a database. An inter-rater reliability test was performed for each independent task that the searchers were required to perform. The study did not begin until the inter-rater reliability of the group was acceptable (defined as a kappa statistic of greater than 0.8 for each task).
Four coders were then trained to systematically categorize the relevance of content on the Web pages that had been saved by the searchers. The coders for the English materials were health-services researchers with graduate training in public health. The coders for the Spanish materials were fluent Spanish-speakers with graduate degrees. A full day training session was conducted including multiple standardized exercises designed to train coders to recognize content related to the four conditions of interest. An inter-rater reliability test was performed for each independent task that the coders were required to perform. Classification of the content pages did not begin until the inter-rater reliability of the group was adequate (defined as a kappa statistic of greater than 0.8 for each task).
Evaluating Characteristics of Top Web Sites Listed by Search Engines
To answer our first research question about search engines, we compared the extent to which search engines identify the same or different Web sites (measured as degree of overlap), when searchers entered 4 simple search terms into the query boxes of each of the 14 search engines studied. The search terms were:
Because prior research suggests that individuals searching Online tend to look at the Web sites listed first following a search, we only compared the degree of overlap among the first ten Web sites identified by each of the search engines (Cyber Dialogue, 2001).
Comparing the Efficiency of Search Engines
To answer our second research question about search engines, we examined the efficiency of search engines in two ways. First we compared the proportion of relevant links on the first page of results on each search engine. Second, we compared the proportion of relevant content pages obtained by following 10 relevant links on the first page of results on each search engine. These comparisons were based on the search engine results obtained by entering the same search terms described in the preceding paragraph.
We defined a link as relevant if (1) the search term of interest (i.e., breast cancer, childhood asthma, depression or obesity) was present in the link itself or in the surrounding text or (2) one of 30-40 related key terms (i.e., tamoxifen, inhaler, gastric bypass surgery, St. Johns Wort) was present in the link itself or in surrounding text. The list of related terms created for the English-language search engines was modified for cultural appropriateness and translated for use on the Spanish-language search engines. The lists are provided in Appendix A.
After all relevant links on the first page of each search engines list were identified; searchers then followed a sample of ten relevant links to determine whether they led to relevant content. The ten relevant links were chosen as follows:
Searchers clicked on the selected relevant links until they reached a Web page with content (defined as when 50% of the space occupied contained text that was not primarily an index of the Web site). If the first relevant link led to a content page, they saved the page for further analysis. If the first relevant link led to another list of links, the searcher identified the first 15 relevant links on that page and then selected one of those links to follow. From that list of 15 relevant links, a random one of those was selected for the searcher to follow. This process was repeated up to ten times. If after 10 of these cycles, searchers had not reached a content page, the last page was saved for analysis.
Characterizing the Content on Web Pages. Trained coders used a rating form to characterize the content on the Web pages identified by the search study described in the preceding paragraph. The content on each Web page was classified according to: (1) relevance of the identified content; (2) type of medical content (e.g, alternative, allopathic); and (3) number of advertisements. Web pages were considered to be relevant if they had any material related to the search terms (breast cancer, childhood asthma, depression and obesity). Sponsorship was classified by type of site providing the relevant content (e.g., advocacy, medical organization, e-health). Web pages that sold information, services or goods were classified as having promotional material (defined as material designed to encourage site visitors to purchase products or services or participate in research programs sponsored by the site). Advertisements were classified separately from promotional material and defined as advertising material only if physically located in one of two specific positions (banner or sidebar) on the Web page.
Data Collection and Management
The study was conducted between September 25 and September 28, 2000. Working in a RAND computer laboratory, searchers used Dell Pentium II Processor Computers equipped with Sony Multiscan 100sf monitors and S3 ViRGE-DX/GX Video Cards. Each system was configured with 64 megabytes of RAM and connected to the RAND Internet by an Ethernet Connection. All computers were operated using Windows 98 and Internet Explorer 5.0 with cookie files (used by Web sites to identify when Internet searchers visit their sites) disabled.
One of the methodological challenges in studying health information on the Internet is the ability to study the Web pages as they appear on the users screen. For this study, we utilized an Internet software application called Catch theWeb® produced by Math Strategies in Greensboro, North Carolina. This software enabled the project researchers to accurately save the Web pages for use at a later date as they appeared on the screens. This ensured that coders classified the pages as they would have appeared to a user on the screen.
Analytic Methods
All analyses were conducted separately for English and Spanish sites and search engines. All statistical tests were two-sided, and were assessed for significance at the 0.05 level.
The unit of analysis was the link (specific URL [uniform resource locator]). For analyses of Web site overlap and the proportion of links leading to relevant content, the universe was all selected links. For analyses of the proportion of links that were relevant, the universe was all links on the first results page.
For each relevant link, the number of engines listing that link in their top ten relevant links was noted. The uniqueness of a search engines links was measured by computing the average number of other engines finding each of that engines top relevant links, then dividing by the number of other engines. This yielded the probability that a single randomly selected search engine would list a randomly selected link from the first engines top ten links in its own top 10 list of links.
In evaluating the proportion of links leading to a given outcome, an omnibus 10 x 2 (9 degree of freedom) chi-squared test of homogeneity was used to assess whether proportions differed among the search engines. If the omnibus test was significant, a series of ten 2 x 2 (1 degree of freedom) chi-squared tests contrasting the proportion in each search engine to the mean proportion of the remaining search engines was conducted. Search engines yielding statistical significance in this series of tests were considered significantly different from the overall mean. A similar two-stage procedure was used to assess the existence of differences in proportions by condition.
Results
Degree of Overlap. Figure 2.2 shows the degree of overlap among English-language search engines in the top 10 Web sites identified in response to a structured search. On average, 11% of Web sites found by one English search engine appeared on the top ten list of another search engine (range 1-24%). For Spanish-language search engines, the average was 25% (range 11-33%) (Figure 2.3).
Efficiency of Searches and Relevance of Content by Search Engine. The first page of search results from all 10 English-language search engines listed 3735 links, 1265 (34%) of which were relevant. A typical search produced a list of 93 links of the first page of search results, about one third of which were relevant to the search (had the search term or a related word in the title). Among the English-language search engines studied, AltaVista, Direct Hit and Metacrawler produced higher than average proportions of relevant links, while Excite and Northern Light produced lower proportions. On average, 3 in 5 of these relevant links reached information related to the search. Relevant links found using Northern Light and Google were significantly more likely to reach information related to the search; relevant links using Direct Hit, Goto and AltaVista were significantly less likely to do so. Relevant links led to relevant content within ten clicks 59% of the time (Table 2.3, column 3). Overall, consumers using English search engines have a 1 in 5 chance if finding information that is relevant to their search.
To address the concern that relevant links sampled from the lower portions of the first results page might have been less likely to lead to relevant content, we compared the first five relevant links on the first results pages to other sampled relevant links. Sixty-one percent of the former links and 58% of the latter links led to relevant content, a difference that was not statistically significant. Overall, relevant content was one click away from a relevant link about one-third of the time with considerable variation among search engines (range, 10-52%) (Table 2.3, column 5).
The first page of search results from all Spanish-language search engines listed 1685 links, 296 (18%) of which were relevant. A typical search produced a list of 105 links, fewer than one-fifth of which were relevant. Yahoo Espanol produced higher than average proportions of relevant links, while Yupi produced lower proportions. On average, about 3 in 5 of these relevant links reached information related to the search. Relevant links found using TeRespondo were significantly more likely to reach information related to the search; relevant links using Quepasa were significantly less likely to do so. When links led to information related to the search, it required more than one click to find this information 62% of the time.
Like the English-language search engines, the first five relevant links on the first results page were no more likely to lead to relevant content than other sampled relevant links (58% and 69% respectively reached relevant content, a difference that was not statistically significant).
Efficiency of Search Engines by Condition. The proportion of links identified as relevant on English search engines varied by condition with obesity having the lowest performance (23% of links were relevant) and breast cancer the highest (46% of the links were relevant), as shown in Table 2.4. Childhood asthma had the highest proportion of relevant content pages reached from relevant links in one click (48%). Obesity had the lowest proportion of relevant content pages reached from relevant links in one click (22%).
On the Spanish-language search engines, the proportion of relevant links was low ranging between 14 and 21% for the four medical conditions. The proportion of pages with relevant content reached from relevant links ranged from 79% for childhood asthma to 58% for both breast cancer and depression. The proportion of relevant content pages reached in one click from a relevant link was also highest for childhood asthma (62%) and lowest for obesity (38%).
Type of Information Found. Allopathic medical content (e.g., information on surgical options for breast cancer) was found on two-thirds of English-language relevant content pages. Alternative medical content (e.g., information on herbal therapies for breast cancer) was found on 11% of English-language relevant content pages. Results in Spanish were similar to English.
Commercialization. Both explicit advertisements (banner and sidebar ads) and promotional material (products or information promoted outside of the banner or sidebar position) were common on relevant content pages. More than half of relevant content pages contained explicit advertisements and 44% contained promotional material that was not presented as an explicit advertisement. Pages without alternative medical content were less likely to contain promotional material than pages with such content (32% vs. 64%, p<0.05). The presence of advertisements and promotional materials on relevant Spanish-language content pages was 36% and 21%, respectively.
Discussion
The tremendous amount of information available to consumers is clearly one of the major attractions of the Internet as a means for obtaining health information, but consumers must sift through a large amount of materials during their searches. This study found that search engines are only moderately efficient in locating information on a particular health topic and the efficiency with which relevant information can be found varies significantly across search engines and conditions. Overall, just one in five links identified by 10 English-language search engines and one in eight links from 4 Spanish-language search engines led to a Web page with content relevant to the search. More than half of consumers who use the Internet report that they use search engines to find health information, and they spend about a half hour on such searches (Carolyn Gratzer, Cyber Dialogue, Oral Communication, October 13, 2000), so efficiency and the relevance of information retrieved are important aspects of performance.
Although advertising and other non-explicit promotional material were common, it was beyond the scope of this study to evaluate whether or not consumers have difficulty recognizing commercial or promotional health information.
In addition, when we reviewed the top ten Web sites listed on each of the search engines, the degree of overlap was quite small (only 11% overall). This level of variability across search engines and conditions suggests that the likelihood of finding the information one needs varies considerably depending on which search engine is used. No search engine is clearly better than another, but where users start matters.
This section of the study has some important limitations that are worth noting. First, the Internet changes constantly, and we were only able to study it at one point in time. However, without concerted attention to the issue, it seems unlikely that the variability in performance is likely to change dramatically. Second, we looked at a small sample of search engines and conditions, and hence we cannot draw more general conclusions about the performance of all search engines and information on all conditions. However, because we included the most popular search engines, the results should reflect what most people experience. Third, we studied the performance of search engines using very simple search terms describing the medical condition. Our findings regarding the efficiency of search engines in yielding relevant content might have been quite different if more sophisticated search strategies were employed. Fourth, the research conducted here was not a naturalistic experiment (e.g., using actual consumers to search for information and testing their knowledge after such a search) so we cannot draw conclusions about what consumers actually encounter when they search for information, or about how well they are able to judge the quality of the information they find. However, the systematic nature of our methods provides a backdrop for future studies of actual consumer behavior--we can compare what consumers are able to find with what is actually out there to find.
[1] Danny Sullivan (editor of Search Engine Watch.com), Tamas Dosckocs (National Library of Medicine) and Dan Durazo (Durazo Communications).
Table of Contents
Chapter 1
Chapter 3