Images on the Web present a major accessibility issue for the visually impaired, mainly because the majority of them do not have proper captions. This paper addresses the problem of attaching proper explanatory text descriptions to arbitrary images on the Web. To this end, we introduce Phetch, an enjoyable computer game that collectsexplanatory descriptions of images. People play the game because it is fun, and as a side effect of game play wecollect valuable information. Given any image from the World Wide Web, Phetch can output a correct annotationfor it. The collected data can be applied towards significantly improving Web accessibility. In addition toimproving accessibility, Phetch is an example of a new class of games that provide entertainment in exchange forhuman processing power. In essence, we solve a typical computer vision problem with HCI tools alone.
The Web is not built for the blind. Only a small fraction of major corporate websites are fully accessible to the disabled, let alone those of smaller organizations or individuals . However, millions of blind people surf the Web every day, and Internet use by those with disabilities grows at twice the rate of the non-disabled .
One of the major accessibility problems is the lack of descriptive captions for images. Visually impaired individuals commonly surf the Web using "screen readers," programs that convert the text of a webpage into synthesized speech. Although screen readers are helpful, they cannot determine the contents of images on the Web that do not have descriptive captions. Unfortunately the vast majority of images are not accompanied by proper captions and therefore are inaccessible to the blind (as we show below, less than 25% of the images on the Web have an HTML ALT caption).
Today, it is the responsibility of Web designers to caption images. We want to take this responsibility off their hands. We set our goal to assign proper descriptions to arbitrary images. A "proper" description is correct if it makes sense with respect to the image, and sufficient if it gives enough information about its contents. Rather than designing a computer vision algorithm that generates natural language descriptions for arbitrary images (a feat still far from attainable), we opt for harnessing humans. It is common knowledge that humans have little difficulty in describing the contents of images, although typically they do not find this task particularly engaging. On the other hand, many people would spend a considerable amount of time involved in an activity they consider "fun." Thus, like the ESP Game, we achieve our goal by working around the problem, and creating a fun game that produces the data we aim to collect.
We therefore introduce Phetch, a game which, as a side effect, generates explanatory sentences for randomly chosen images. As with the ESP Game, we show that if our game is played as much as other popular online games, we can assign captions to all images on the Web in a matter of months. Using the output of the game, we mention how to build a system to improve the accessibility of the Web.
Design of a Useful Game
A traditional algorithm is a series of steps that may be taken to solve a problem. We consider Phetch as a kind of algorithm. Analogous to one, Phetch has well-defined input and output: an arbitrary image from the Web and its proper description, respectively. Because it is designed as a game, Phetch needs to be proven enjoyable. We do so by showing usage statistics of a oneweek trial period. Because it is designed to collect a specific kind of data, Phetch's output needs to be proven both correct and sufficient. We prove this through a specifically designed experiment.