Evaluation of English and Spanish Health Information on the Internet

Chapter 3

Quality of Health Information on the Internet

Purpose

Today, the millions of individuals who go online to find health information can choose from tens of thousands of health-related Web sites, each offering pages and pages of materials on health. Unlike some of the traditional approaches to obtaining health information, they are accessible to anyone with a computer at any hour, and most are free. When individuals go online, surveys indicate many are looking for information on specific medical conditions (Cyber Dialogue 2000). Obesity and cancer were among the most heavily researched medical conditions (Cyber Dialogue 2000). Studies also show that Internet users pay attention to what they find online. More than 70% say online health information has influenced a decision about their treatment (Fox, 2001).

To date, little research has been conducted to try to describe the types and quality of information an individual might find on a variety of English- and Spanish-language health-related Web sites for multiple health conditions. In this chapter, we explore three questions about health information on the Internet:

Methods

To answer the questions posed in this chapter we selected four target medical conditions, chose 26 English- and Spanish-language Web sites (both general health and condition-specific), developed specific questions on which Web sites were evaluated for coverage and accuracy, selected material relevant to those questions on each Web site, conducted a systematic review of the material, and analyzed the results. Each of these steps is described below in greater detail.

Selecting Medical Conditions

We selected four target medical conditions affecting diverse populations: breast cancer, childhood asthma, depression, and obesity. These conditions were chosen because they are prevalent, affect diverse populations, and represent conditions for which many consumers might seek online information (Allison et al, 1999; Mokdad et al, 1999; Mannino et al, 1998; Kessler, 1994). The conditions cause significant morbidity and mortality, so health information may have the potential to greatly improve patient education and participation in the management of their health problems. While the findings for these four conditions cannot be generalized to all health information on Web sites, they provide a broad overview of what consumers are likely to find.

Selection of Web Sites

Eighteen unique English-language and 7 unique Spanish-language health Web sites were selected for this study; each provided information on some or all of the four medical conditions. Web sites were chosen based on either popularity or because they were devoted to one of the conditions we were studying. We selected 6 English-language general health Web sites that were ranked highly in 2 widely used Internet industry reports, Cyber Dialogue and PC Data Online for September 2000. Content provided by one of the most popular search engines was also included. Sites providing information on more than one condition are referred to as general consumer health Web sites; sites providing information on only one condition are referred to as condition-specific Web sites. Condition-specific English-language Web sites and all Spanish-language Web sites were selected by project staff to represent prominent examples of condition-specific Web sites from commercial, government, and nonprofit educational organizations. The Spanish-language Web sites included four condition-specific sites and three general consumer sites. The English- and Spanish-language Web sites we studied and their specific URL addresses are listed in Table 3.1.

Background Information on English- and Spanish-Language Web Sites

There has been considerable attention in the press about the financial performance of a variety of companies on the Internet. In Tables 3.2 and 3.3 we provide some descriptive information about the English- and Spanish-language Web sites we studied, including revenue sources, nature of the advisory board (if present), and who is responsible for writing the content offered on the Web site.

English-Language Web sites. Of the English-language Web sites evaluated, three are publicly held corporations, five are subsidiaries of publicly held corporations, four are private corporations or are sponsored by private entities, three are funded by grants, donations or membership fees, and four are government sponsored through the National Institutes of Health. One Web site, Obesity-online.com, listed no information as to the sponsoring entity.

All of the English-language Web sites except for two (Allhealth.com and Yahoo.com) listed a medical advisory board, editorial board, or committee that has input into and/or oversight of the health content. Some sites only list board member names and credentials, however, many of the Web sites also provide detailed biographical information. Allhealth.com does not appear to provide any information regarding a medical or editorial advisory board. However, it appears that all of the health content on Allhealth.com is written by outside medical and health experts whose credentials, background, and experience are well documented. Yahoo.com also does not indicate that there is a medical advisory board to oversee the site's health content. However, according to Yahoo, all of the health information on the site is provided by credible companies in the health care industry.

With regard to medical content, all of the publicly held Web sites provided information on who writes their health content. The four government sites did not offer specific information about content writers, but indicate that content is written by medical experts within the institutes. Three had no information regarding the source of their health content. Of the five remaining sites, three provided information on their health content and two made reference to medical editors or writing teams but gave no specific information.

Spanish-Language Web sites. Of the seven unique Spanish-language Web sites evaluated, two are private corporations or are sponsored by private entities, one is funded by grants, donations, and membership fees, two are government sponsored through the National Institutes of Health, and two provide no information as to the nature of their sponsoring entity.

All but two of the Web sites listed a medical advisory board, editorial board, or committee that has input into and/or oversight of the health content. Some sites only list board member names and credentials, however, many also provide detailed biographical information. Saludlatina.com and Centropeso.com do not appear to have any information regarding a medical or editorial advisory board. However, it appears that all of the health content on Saludlatina.com is written by outside medical institutes, health organizations or journals, which are documented on the site. Centropeso.com does not indicate the source of any of its health content.

With regard to medical content, two sites (Noah-health.org and Salud.com) out of seven provide information on who writes their health content. The two government sites do not offer specific information about content writers, but indicate that content is written by medical experts within the institutes. The remaining three sites (Saludlatina.com, GraciasDoctor.com and Centropeso.com) have no specific information regarding authorship of their health content. As mentioned, Saludlatina.com cites the majority of its health content from outside sources. GraciasDoctor.com occasionally cites an author but gives no more information than a name. Centropeso.com gives no information regarding the authors of its health content.

Selecting Expert Panelists

To assess the quality of health-related information on the Web site, a series of condition-related topics and corresponding consumer-oriented questions deemed essential for consumers to know were identified. To generate these topic areas and questions for each of the four medical conditions, small panels consisting of clinical experts and representatives from patient advocacy organizations for each of the four conditions were assembled. Each panel consisted of 3-4 experts recruited for their national reputation in condition of interest, clinical or scientific experience, familiarity with national guidelines and current research. Appendix B lists the expert panelists for each medical condition.

Developing Condition-Related Topics and Consumer Questions

We developed condition-related topics and corresponding consumer questions through a structured, multi-step process involving both RAND staff and the expert panels. RAND staff developed an initial list of topic areas and consumer questions based on a review of national guidelines and scientific literature as well as informal discussions with clinical experts in each of the medical fields (American College of Radiology, 1998; Williams and Wilkins, 1996; National Task Force on the Prevention and Treatment of Obesity, 2000; National Heart, Lung and Blood Institute, 1998; Mulrow et al, 1999; Linde et al, 1998; Cole et al, 1999; Mynors et al, 1995; Wilson et al, 1998; American Psychiatric Association, 1994)

Concurrent with the RAND staff process, each expert panel member was asked to submit five to ten consumer-oriented questions that would reflect the concerns of patients, their families, or lay persons seeking information on the study conditions. Panelists were asked to consider questions from each of three categories: (1) condition-related topics about which there is broad expert consensus and for which clear guidelines exist; (2) clinical topics about which there is uncertainty; (3) recent important developments in screening, diagnosis or treatment of the condition. An example of the first category of question is when to start breast cancer screening using mammography in women over 50. An example of the second category relates to the issue of breast cancer screening using mammography in women aged 40-49 years. An example of the third category of question is the recall of fentermine/phenfluramine for weight loss.

The lists developed by RAND staff and the expert panelists were combined to create the master preliminary list. Panelists were asked to rate each topic area and corresponding question independently on acceptability of the topic or question and level of consensus. Acceptability was rated as: (0) not acceptable to include; (1) acceptable to include: (2) preferable to include; (3) essential to include. Level of consensus was rated as: (0) no consensus; (1) minimal level of consensus; (2) moderate level of consensus; (3) high level of consensus. RAND staff collected the ratings from each panelist and developed a score on acceptability and level of consensus for each topic area and consumer question.

Panelists then met on a series of conference calls to narrow the list of topics and questions through discussion. The list agreed to on the conference call as the top five to ten questions were rated a second time by panelists independently using the acceptability and level of consensus ratings described above. RAND staff compiled the results of the second round of ratings and sent the results to the panelists. Each panel reassembled on a conference call to review the ratings for each of the topics and questions. The final five to seven questions used in this study to assess quality were agreed to by consensus of the panelists during the conference call. Panelists were asked to consider whether the topic or question was relevant to consumers, important for them to have an answer to, and necessary to find on any Web site offering information about the condition. The final set consisted of 26 condition-related topics and 36 consumer-oriented questions across the four medical conditions. They can be found in Appendix C.

Development of Condition-Related Clinical Elements

Based on extensive literature reviews, the panels then developed a series of standardized clinical elements (concepts that should be addressed) for the topics and questions. These concepts were, in effect, the sort of clinical information that consumers should expect to find for a given condition on a Web site. The expert panelists were also involved in developing these elements. RAND staff began by drafting a set of proposed condition-related clinical elements based on a review of national guidelines and the scientific literature. These condition-related clinical elements were then circulated to the expert panelists for comment. RAND staff compiled comments from all of the experts and sent the final set of clinical elements out to the panelists for final approval. For example, for the topic of breast cancer screening, 4 clinical elements were developed. These included: women older than 50 years should have mammograms every 1 to 2 years; early detection of breast cancer improves outcomes; most breast cancers occur in women without a family history; and a lack of consensus exists about the need for or appropriate interval of mammography in women from age 40 to 49 years. A final set of 100 clinical elements were developed. Appendix D lists the condition-related topics and the clinical elements.

Retrieving Health Information from English- and Spanish-Language Web Sites

Between October 18-30, 2000 and November 6-13, 2000, 4 abstractors (2 monolingual in English, 2 bilingual in English and Spanish) each spent 90 minutes independently reviewing each Web site using efficient DSL connections. The time limit was based on pilot data and studies showing that consumers spend about 30 minutes looking for health information during a given search session (Cyber Dialogue, 2000). Each abstractor was asked to retrieve any information related to the consumer questions developed by the expert panelists. Abstractors did not receive any of the condition-related clinical elements prior to conducting each search. All searches began at a common starting place (e.g., condition-specific page or home page if a condition-specific page did not exist). On average, about 65% of pages selected from the Web sites were common between abstractors. The search results were saved using a software application called CatchTheWeb® (Math Strategies, Greensboro, NC). This software application allowed project researchers to accurately save, abstract, and manage Web pages for use at a later date.

Overall, 2662 Web pages (defined by the programmer's end-of-page mark) containing 21,711 printed pages of material were retrieved from Web sites across the four conditions; 19,529 printed pages were retrieved from the English-language sites and 2,182 printed pages were retrieved from the Spanish-language sites.

All materials from each search were then assembled into separate notebooks with features identifying the source site removed before the review process. Each notebook contained the materials retrieved from a single search and an accompanying standardized rating form. The 78 unique English-language notebooks averaged 250 pages and ranged from 21 to 547 printed pages. The 32 unique Spanish-language notebooks averaged 68 pages and ranged from 8 to 366 printed pages.

Evaluating the Information Retrieved from the Web Sites

A total of 34 (30 monolingual in English, 4 bilingual in English- and Spanish) physicians from around the United States were recruited to review the abstractor-retrieved materials. All physician-reviewers were board-eligible or board-certified in family medicine, general surgery, internal medicine (including allergy and immunology, hematology and oncology, infectious diseases, pulmonary and critical care), or pediatrics. We gave the physician-reviewers the Web site materials in the form of a notebook (described above); each notebook contained the content found by one abstractor on one Web site and an accompanying rating form. Each reviewer rated materials for one to four conditions. Each physician-reviewer rated at least one notebook; no reviewer rated more than five notebooks of material for any condition. No physician-reviewer rated materials from the same site twice. Forty English-language (51%) and 14 Spanish-language (44%) notebooks were selected at random for a second review to evaluate inter-rater reliability.

Rating Form Development. To obtain ratings from the reviewers, RAND staff developed a rating form for each condition. The form listed the topic area, the corresponding consumer question, and the clinical elements. Ratings were obtained at the level of the clinical element on both coverage and accuracy. The rating form also provided space for reviewer to write in notes about any conflicting information identified during their review of the materials in the notebook.

Rating Coverage. Reviewers were first asked to rate the coverage for each clinical element on a three point scale (0 = not addressed; 1 = minimally addressed; and 2 = more than minimally addressed). Not addressed meant there was no reference to the issue on any page of the notebook. For example, under screening for breast cancer, if no mention was made of the use of mammography for screening that would be rated as not addressed. Under therapeutic modalities for childhood asthma, for example, if inhaled corticosteroids or inhaled beta2-agonists were not mentioned that would be rated as not addressed. If under treatments for depression, there was no mention of antidepressant medications, that would be rated as not addressed. And for obesity, if there was no mention of body mass index, the clinical element related to a definition of obesity would be rated as not addressed.

Minimally addressed meant the concept was mentioned at least briefly. For example, under screening for breast cancer, if mammography was mentioned as a way to identify early breast cancers, but no mention was made of who should have mammograms, how often they should be done, or their utility in reducing breast cancer mortality, this would be considered minimal coverage. For example, if under triggers that contribute to an exacerbation of childhood asthma, indoor triggers are mentioned but specific examples such as cockroach antigens, tobacco smoke, and dust mites are not mentioned this would be minimal coverage. Under treatments for depression, if antidepressants are mentioned but no mention of side effects related to the anti-depressants are found, this would be minimal coverage. For obesity, if body mass index was mentioned but the ranges for overweight (25-29.9) and obesity (30+) were not provided, this would be classified as minimal coverage.

More than minimally addressed meant most or all of the elements listed in the topic areas were at least mentioned and the level of explanation was more than cursory. For example, reviewers would rate coverage as more than minimal if a Web site mentioned that screening mammography was the best way for breast cancer to be detected early in women older than 50 years, or that breast cancer may be detected earlier by mammography than physical examination, or if a detailed discussion of the pros and cons of mammography and the appropriate ages were provided. For childhood asthma an example of a more than minimally covered item would be under inhaled medications that reduce inflammation of the airways the two available types of this kind of medication (inhaled steroids and Cromolyn) are listed, as well as their indicated uses (patients with persistent asthma who are having uncontrolled symptoms). An example of more than minimal coverage under treatments for depression would be that various types of treatments including drug and non-drug therapy as well as side effects were mentioned.

Rating Accuracy. For each clinical element that was at least minimally addressed, reviewers also rated the accuracy of content on a three point scale (0 = content was mostly incorrect; 1 = content was mostly correct; and 2 = content was completely correct). Reviewers were always instructed to give the higher score if they were uncertain about the rating.

Report on Conflicting Information. After completing ratings of coverage and accuracy, reviewers of English-language sites were asked to list instances of conflicting information found during their review. These conflicts did not necessarily involve the set of clinical elements for which coverage and accuracy were evaluated. All examples of conflicting information were collected from the reviewed Web site materials. Six categories of conflicting information were identified. The categories included: 1) treatments; 2) diagnosis; 3) definitions; 4) adverse effects; 5) etiology and risk factors; and 6) incidence and prevalence. Two project staff physicians then independently rated whether the examples of conflicting information were minor, significant or potentially dangerous. Examples that were identified as significant or potentially dangerous by both raters were included in the final analysis. Disagreements were settled by discussion between raters with a tendency to rate the conflict as less significant.

Authorship, Dating and Currency. Web sites were rated by RAND staff according to the HON (Health on The Net Foundation) code's criteria of authorship (whether authors and their affiliations and credentials were clearly identified); and dating (whether the date the Web sites materials were created or updated was specified). Where dates were specified the currency, defined as the most recent date of modification, was coded.

Units of Analysis and the Derivation of Outcome Variables. The unit of analysis was the rating form for analyses of coverage, accuracy, and conflicting information. For overall (cross-topic) analyses of coverage and accuracy, clinical element-level scores were aggregated at the reviewer (rating form) level into a single observation of each outcome variable per rating form. Summary scores at the condition or Web site level were derived as averages of these reviewer-level observations. For topic-level summaries of coverage and accuracy, clinical element-level of scores were aggregated at the topic level within reviewers, then averaged across reviewers. The unit of analysis was the Web page for the study of authorship, dating and currency, excluding Web site home pages.

General Analytic Approach. All analyses were conducted separately for the English- and Spanish-language Web sites. All statistical tests were two-sided, and were assessed for significance at the 0.05 level. Measures were tested for variation by condition and by site within condition. A two-stage test procedure was used to examine variation in each outcome by these independent categorical variables. First, an omnibus or overall test of the association was performed. If the omnibus test confirmed that variation in the outcome of interest was statistically significant for a given categorical variable (condition or site), a series of two-sample follow-up tests was performed. These follow-up tests compared the outcome at each level of the categorical variable to the overall level of the outcome.

The omnibus tests employed were one-way analysis of variance (ANOVA), the Kruskal-Wallis rank-sum test, and the chi-squared test of homogeneity for measures that were normally, ordinally, and nominally distributed, respectively. Two-sample t-tests, Wilcoxon rank-sum tests, and chi-squared tests of homogeneity were the corresponding follow-up tests.

Analyses of Coverage. The extent to which Web sites provided information relevant to each of the questions/topics and the condition was calculated as the average proportion of condition-related clinical elements that the reviewers rated as: (1) not covered; (2) minimally covered; (3) more than minimally covered. Quality rating forms contained multiple ratings (corresponding to clinical elements) of coverage and accuracy using the three-point ordinal scales. For purposes of analysis, global measures were computed across clinical elements within a given rating form.

Three global or summary measures of coverage were computed across all ratings of coverage within a rating form: (1) the proportion of condition-related clinical elements that were rated as not covered; (2) the proportion of condition-related clinical elements that were rated as minimally covered; (3) the proportion of condition-related clinical elements that were rated as more than minimally covered.

Analyses of Accuracy. Accuracy was assessed only for items on which coverage was rated as minimally or more than minimally covered. Accuracy was calculated as the average proportion of condition-related clinical elements rated by reviewers as: (1) mostly incorrect; (2) mostly correct; or (3) completely correct. Accuracy was assessed only for items on which coverage was rated in the highest category. One global measure of accuracy was computed across such items: the proportion of covered items that were rated as completely correct.

Combined Score for Coverage and Accuracy. We computed the proportion of clinical elements for each topic rated by reviewers as both more than minimally covered and completely accurate. A global measure, combined coverage and accuracy at the clinical element level, computing the proportion of all condition-related clinical elements that received scores in the top category in both coverage and accuracy. Note that since these dependent measures are not binary at the rating-form level, but are quasi-continuous variables constrained between 0 and 1.

Simulations of Consumer Search for Sites with Extensive Topic Coverage

In order to simulate the experience of a consumer trying to identify a health Web site with extensive coverage of a particular health topic, we created a model based on the coverage results of the studied Web sites and applied it to a subset of eight condition-related topics (two per condition) that were thought to be of particular consumer interest. This model assumed that the consumer searched from a large universe of health Web sites until finding a site with more than minimal coverage for 75% of the five to seven clinical elements corresponding to the topic of interest. For the purpose of this simulation, we assumed that the levels of coverage observed among the studied sites were representative of the levels of coverage in the larger universe of health sites that addressed the condition to which the topic of interest applied. We further assumed that the log-odds of the levels of more-than-minimal coverage for the studied sites came from a normally-distributed universe of log-odds coverage proportions in the larger universe of sites.[2] Taken together, these assumptions allowed us to estimate the proportion of sites in the universe of sites addressing that condition that would provide such a level of coverage for a selected topic. This in turn allowed us to estimate the average (expected) number of sites that would have to be visited and examined (never returning to the same site) before finding a site with the desired level of topic coverage. When the expected average number of sites needed exceeded the total number of sites studied for a language and condition (nine or ten for English, four for Spanish), we concluded that there was not clear evidence of the existence of any sites within that language that provide extensive coverage of the topic within a single site.

Analyses of Conflicting Information. For each of the six conflict categories a binary variable was created, indicating the presence or absence of any significant or potentially dangerous conflict of a given category on each rating form. The two-stage testing procedure described above was used to assess variation in the prevalence of conflicting information by condition among English-language Web sites.

Analyses of Authorship and Dating. Two binary indicators of whether a content-containing Web page listed an author or a date of creation or modification were combined to construct a three-level ordinal scale for Web pages: (1) neither author nor date; (2) either author or date; (3) both author and date. The two-stage testing procedure described above was employed to assess variation in the proportion of Web pages for a given site or condition that had (1) neither author nor date (2) both an author and a date.

Inter-Rater Reliability. Two measures of physician-reviewer inter-rater reliability of Web site quality ratings were computed. A standard measure of reliability, the correlation in ratings between physician-reviewers examining identical material retrieved from the same Web site, was calculated. Because we wanted to assess the sensitivity of physician-reviewer ratings to variation in the retrieved material (e.g., the material retrieved by abstractor 1 versus abstractor 2 on the same Web site), a second, more stringent, measure was also computed. This was the correlation between ratings of different physician-reviewers examining different notebooks of material from the same Web site and represents the inter-rater reliability of the entire process we used to evaluate sites. We computed 16 interrater reliabilities by the standard rule and 16 by the stringent rule for each language: 1 for every combination of the 4 conditions and the 4 assessments (any coverage, more than minimal coverage, completely correct, and the combination of more than minimal coverage and complete correctness). Thirty reviews were included in each calculation of interrater reliability on Spanish-language Web sites. The standard inter-rater reliability was high: 0.90 or greater in all instances, averaging 0.96 for both English- and Spanish-language sites. The more stringent inter-rater reliability was also high for English Web sites (average reliability of 0.77) and fair for Spanish Web sites (average reliability of 0.60).

Results

We found variation in the quality of information available among conditions, topic areas and Web sites.

Coverage and Accuracy of Selected Health Topics on English-Language Web Sites

Across English-language Web sites, the clinical elements (identified by expert panelists as important for a Web site to include) that were more than minimally covered varied significantly by condition: 67% for breast cancer, 43% for childhood asthma, 53% for depression, and 40% for obesity (Table 3.4, p<0.05). There was also statistically significant variation in coverage among Web sites within conditions. For example, within breast cancer, rates of more than minimal coverage ranged from 31 to 90% (Table 3.4, p<0.05). Six of ten breast cancer sites, two of ten depression sites, and no childhood asthma sites or obesity sites provided more than minimal coverage of two-thirds of the clinical elements.

Accuracy of health information was generally high across English-language Web sites. Among the clinical elements that were at least minimally covered, the average percentage that were completely correct was 91% for breast cancer, 84% for childhood asthma, 75% for depression and 86% for obesity (data not shown).

We found significant variation among English-language Web sites in the proportion of clinical elements that were both more than minimally covered and completely accurate. For example, the percentage of condition-related clinical elements covered more than minimally and completely correctly for depression ranged from 13 to 73% among English-language Web sites (Table 3.4).

Coverage of Selected Health Topics on English-Language Web Sites by Condition

Breast Cancer. Among English-language sites, materials related to three topics (screening, risk assessment, and treatment) were more than minimally covered and completely correct 70% of the time (Table 3.5). Topics that were not covered most often included alternatives to standard medical and surgical treatments (28%) and evaluation of a palpable breast mass (18%). If the selected topics were addressed, they tended to be accurate most of the time with ratings ranging from 86% (breast cancer screening) to 96% (alternative therapies). There was statistically significant variation between English-language Web sites (Table 3.4). One Web site (Oncolink.com) performed statistically better than average (Table 3.4).

A consumer randomly searching a large universe of Web sites addressing breast cancer (according to the simulated consumer search for extensive topic coverage described earlier in this chapter), might be expected to find extensive coverage[3] of the topic regarding treatment of breast cancer within two Web sites, but would be expected to have to visit four Web sites before finding extensive coverage of evaluation of a palpable breast mass.

Childhood Asthma. There was more variation in coverage among the seven selected topics for childhood asthma (Table 3.5). Overall coverage for all condition-related topics was generally lower (Table 3.5). The topics receiving the best coverage related to therapeutic modalities and side effects (65%) and the etiology of asthma (46%). Topics that were not covered most often included symptoms suggestive of poorly controlled asthma (48%), initial management of severe asthma (33%) and common symptoms (33%). Topics that were addressed tended to be completely accurate with ratings ranging from 72% (signs of poorly controlled asthma) to 98% (etiology). Three percent of materials related to the selected topics contained specific factual inaccuracies. As an example, one childhood asthma site describes cockroaches as the leading cause of asthma among children.

No topic area received a combined score (more than minimal coverage and completely accurate) of greater than 50%. No Web site performed statistically better than the condition average for childhood asthma (Table 3.4).

A consumer randomly searching Web sites addressing childhood asthma would be expected to search four sites before finding extensive coverage of information related to symptoms suggestive of poorly controlled asthma.

Depression. Coverage of the seven condition-related topics ranged widely. Topics related to the etiology of depression were more than minimally covered 97% of the time. Whereas topics related to the type of provider to see for depression were more than minimally covered 13% of the time (Table 3.6). Topics related to antidepressant medications were more than minimally covered two-thirds of the time. Accuracy of coverage ranged from 68% (treatment) to 90% (etiology). Three percent of materials related to the selected topics contained specific factual inaccuracies. As an example, one depression site stated that omega-3 fatty acid deficiencies cause major depressive disorders.

Coverage of topics that were both more than minimal covered and completely correct ranged from eight percent (who should evaluate depression) to 87% (etiology). One Web site (nimh.nih.gov) performed statistically better than average (Table 3.4).

A consumer randomly searching Web sites addressing depression would be expected to find extensive coverage of information related to anti-depressant medications within two searched sites.

Obesity. Among English-language Web sites, topics that were covered most often related to health risks (59%) and indications for weight loss (48%). Topics that were not covered most often related to safety and effectiveness of dietary supplements (61%) and risks and benefits of popular diets (49%). Accuracy was generally good, ranging from 78% (indications for weight loss) to 96% (safety and effectiveness of certain dietary supplements). One topic (health risks of obesity) was more than minimally covered and accurate more than half of the time (Table 3.7). No Web site statistically performed better than average (Table 3.4).

A consumer randomly searching Web sites addressing obesity would be expected to search seven sites before finding extensive coverage of materials related to weight-reduction surgery.

Coverage and Accuracy of Selected Health Topics on Spanish-Language Web Sites

On Spanish-language sites, over half of the clinical elements identified by expert panelists as important for a Web site to include were not covered (Table 3.9). The average percentage of clinical elements that were not covered varied significantly by condition: 49% for breast cancer, 33% for childhood asthma, 61% for depression, and 69% for obesity (Table 3.9). Levels of coverage varied significantly by condition (p< 0.05), but not within condition. More than minimal coverage of any of the clinical elements was rare among Spanish-language sites (39% for breast cancer, 27% for childhood asthma, 15% for depression and 16% for obesity).

Coverage and Accuracy of Selected Health Topics on Spanish-Language Web Sites by Condition

Breast Cancer. Among the five condition-related topics selected for evaluation for breast cancer, only two were covered more than minimally more than 50 % of the time (screening and evaluation of a palpable breast mass) (Table 3.10). Topics that were not covered most often related to the treatment options for Stage I and Stage II breast cancer (61%) and alternatives to standard surgical and medical treatments (90%). No Web site performed statistically better than average.

Childhood Asthma. Overall coverage of the seven condition-related topics was low (Table 3.11). One condition-related topic achieved more than minimal coverage 40% of the time (symptoms). Accuracy was also more variable; between 38 and 61% of minimally covered topics were scored as completely correct. The topic with the highest level of coverage related to the symptoms of childhood asthma (44%). No Web site performed statistically better than average.

Depression. Lack of coverage on all condition-related topic areas was particularly striking (Table 3.12). Four of the condition-related topics were more than minimally covered less than 10 % of the time (antidepressant medications, role of counseling, suicidal ideation, and evaluation). Web sites provided more than minimal and completely correct coverage on what to do if an individual was experiencing suicidal ideations five percent of the time (Table 3.12). No Web site performed statistically better than average.

Obesity. More than minimal coverage on all obesity topic areas ranged from zero to 31% (Table 3.13). Topics that were covered most often included materials related to definitions and indications for weight loss (31%) and physical activity and prevention (29%). Topics that were not covered most often included risks and benefits of popular diets (100%) and safety and effectiveness of dietary supplements (100%). Accuracy was variable, ranging from 50% (availability of drugs approved for weight loss) to 81%(definitions and indications for weight loss). One topic (definitions and indications for weight loss) was more than minimally covered and accurate 30 % of the time. No Web site performed statistically better than average (Table 3.9).

Presence of Conflicting Health Information on English-Language Web Sites

In the course of reading the Web site material, many reviewers noted the presence of conflicting information within a Web site. As mentioned previously, these conflicts were not necessarily related to the set of condition-related topics for which coverage and accuracy were measured. For English sites only, we calculated the proportion of times raters noted at least one significant conflict of condition-related information during their review. For example, one childhood asthma Web site reported in one place that using inhaled steroids does not stunt growth in children, and elsewhere it reported that using inhaled steroids does stunt growth in children.

Overall, just over half of Web site reviews revealed one or more conflicts of a clinically important nature (Table 3.14). Conflicts most often involved treatment (35 %) and diagnosis (13 %). Materials on depression most commonly had conflicts on treatment, whereas breast cancer materials most commonly contained conflicts on diagnosis (p< 0.05). Appendix E lists the examples of types of conflicts noted by the physician-reviewers.

Authorship, Dating and Currency of Content on English- and Spanish-Language Web Sites

Approximately 65% of English-language materials listed both an author (institutional, individual or both) and a date (Table 3.15). Forty-six percent of all English-language materials had been created within the past year, and 45% of those dated materials had been modified within the past 1 to 3 years (Table 3.17). Approximately 9% of the materials retrieved from the English-language Web sites contained no evidence of any author or date of publication or modification.

By contrast, 14% of the Spanish-language materials specified both an author (institutional, individual or both) and a date (Table 3.16). Seventeen percent of all the Spanish-language materials had been created within the past year, and 32% of those dated materials had been modified with the past 1 to 3 years (Table 3.17). Approximately 44% of the materials retrieved from the Spanish-language Web sites contained no evidence of any authorship or date of publication or modification.

Discussion

We examined several dimensions of Web site quality (availability of key information, accuracy, identifiable authorship, and currency) related to four common health problems (breast cancer, childhood asthma, depression, and obesity). Although we found thousands of pages of material related to the key clinical topics and questions, we found gaps in the availability of key information.

What Did We Find?

Most sites provided at least minimal coverage of 75% of the condition-related topics we looked for on the sites (Table 3.8). Some sites, however, provided very little information with up to 70% of condition-related topics completely uncovered. Only four of the English-language Web sites (oncolink.com, cancernet.nih.gov, webmd.com and nimh.nih.gov) and none of the Spanish-language Web sites provided more than minimal coverage for at least 80% of the condition-related topics. Breast cancer topic areas were covered significantly more often than all other conditions; topic areas about childhood asthma and obesity were covered significantly less often. Even fewer Spanish-language Web sites provided more than minimal coverage of topics with information that was completely correct. Although the accuracy of information presented was fairly high, over half of the Web sites reviewed revealed one or more conflicts of a clinically important nature, such as about a treatment choice. About 65% of all English-language materials contained an author and a date, and most of the materials were published within 1-3 years. By contrast, 14% of all Spanish-language materials contained an author and a date, and just half of those materials were published within 1-3 years.

Findings from this study suggest that consumers using the Internet may have a difficult time finding information on a health problem. Some of the gaps were particularly striking. For example, less than half of the Spanish-language materials explained that mastectomy and lumpectomy plus radiation are equivalent treatments for early stage breast cancer. If people rely on the Internet to help guide their health decisions, these deficiencies in information could have consequences.

Can We Believe What We Found?

Critics of this study might ask whether the questions we used are really of interest to consumers, whether the selected Web sites we evaluated are representative of the material available, whether the answers we used to judge comprehensiveness and accuracy were reasonable, whether physician reviewers had access to all of the available information on a site, and whether their assessments were reliable. Here, we discuss each of these points in more detail.

To standardize the assessment of content on Web sites, we used a group of experts to identify the key questions consumers should be able to have answered when seeking information on a particular topic. The experts included both health providers who treat patients with these conditions and consumer advocates who represent the interests of patients with these problems. We have provided the questions so that readers of this report can judge for themselves whether the questions are relevant to consumers. A survey of consumers with the conditions would have been extremely useful but was beyond the scope of this study. Because this study was not a natural experiment (e.g., using consumers to search for information and testing their knowledge after such a search), we cannot draw conclusions about what people actually encounter when they search for information, or how well they are able to interpret the information they find.

For the most part, general health sites were selected because of their widespread popularity among consumers. The condition-specific sites, by contrast, are generally less frequently used. These sites were selected largely because they represented a different type of site than some of the most popular ones. Within each language and condition, we compared the average performance of condition-specific sites to the average performance of general health sites with respect to combined coverage and accuracy (the proportion of clinical elements covered more than minimally and with complete accuracy). For breast cancer, a clear pattern was apparent. Within English-language sites, three of the four highest average scores were obtained by the three cancer-specific sites; in Spanish, the best score was obtained by the one cancer-specific site. The performance of the selected set of both English- and Spanish-language breast cancer-specific Web site(s) was significantly better than the performance of the selected set of general health Web sites within the corresponding language on the topic of breast cancer (p < 0.05 in both cases). No such pattern was apparent or statistically significant among the remaining three conditions. This may be because for the other conditions, even the best performing Web sites offered only modest coverage of topics.

Perhaps we found poor performance because our condition-related clinical elements to the questions were too demanding. Some might represent a standard that is "too high," and we cannot be sure that a different research group, assisted by different panelists, would not have generated different concepts. However, our clinical panelists were instructed to take the perspective of patients, not physicians, in determining what information ought to be available. Panelists were also instructed to avoid concepts that were arcane, and we also avoided controversial concepts, except when assessing whether uncertainty or controversy was properly communicated.

Another potential criticism is that because we abstracted relevant information from each site and presented it for review in printed, hard-copy form, physician reviewers did not have access to all relevant materials available on the site. We abstracted information from the sites both to make the review task manageable and to make the Web site being reviewed anonymous (so that reviewers were neither positively nor negatively influenced by knowledge of the source). While we used trained searchers to gather material for the reviewers, they collected what could fairly be described as a sample rather than the entire universe of material available on each Web site. However, the searchers were well trained and were given more time than most consumers spend looking for specific information. If our abstractors could not uncover the material in 90 minutes, it is unlikely that the average consumer could do so.

Finally, although we provided the reviewers with the clinical elements, critics might posit that the assessments were largely subjective. To evaluate the level of agreement among reviewers, a random sample of half of the sites were reviewed by a second physician. We found very high levels of inter-rater reliability suggesting that the assessments were comparable.

So, Where Does This Leave Us?

Given the substantial variation in coverage of key topics across Web sites, consumers should probably not rely on a single Web site to answer all of their condition-specific questions.

Consumers should not assume that even well designed and comprehensive-appearing Web sites contain all essential information on a health topic. There may be gaps. If consumers are truly interested in finding comprehensive answers to their questions, they may need to devote more time to the search than is commonly the case and they must be willing to sort through many different sites–perhaps as many as ten different Web sites to find all relevant information.

Conflicts are not uncommon, and consumers are at risk for becoming confused or misinformed. Much of the conflict probably results from the methods by which information is updated–adding new information without systematically reviewing existing text to remove conflicting information. Conflicts can probably never be eliminated--it would be harder to do so on the Web than with standard multi-authored textbooks because of the multi-dimensional layering of electronic information. Therefore, the Web should not probably serve as the final arbiter of health care information for consumers--they need access to a professional who can clarify inconsistencies and reconcile conflicts.


[2] Because log-odds for 0% and 100% are infinite, we replaced Web site scores of 0% and 100% with 5% and 95% respectively for purposes of estimation.

[3]More than minimal coverage for at least 75% of the indicators for this topic.


Table of Contents
Chapter 2
Chapter 4