We recorded a dataset of more than 280,000 close-up eye images with ground truth annotation of the gaze location. A total of 17 participants were recorded, covering a wide range of appearances:
- Gender: Five (29%) female and 12 (71%) male
- Nationality: Seven (41%) German, seven (41%) Indian, one (6%) Bangladeshi, one (6%) Iranian, and one (6%) Greek
- Eye Color: 12 (70%) brown, four (23%) blue, and one (5%) green
- Glasses: Four participants (23%) wore regular glasses and one (6%) wore contact lenses
For each participant, two sets of data were recorded: one set of training data and a separate set of test data. For each set, a series of gaze targets was shown on a display that participants were instructed to look at. For both training and test data the gaze targets covered a uniform grid in a random order, where the grid corresponding to the test data was positioned to lie in between the training points. Since the NanEye cameras record at about 44 FPS, we gathered approximately 22 frames per camera and gaze target. The training data was recorded using a uniform 24 × 17 grid of points, with an angular distance in gaze angle of 1.45° horizontally and 1.30° vertically between the points. In total the training set contained about 8,800 images per camera and participant. The test set’s points belonged to a 23 × 16 grid of points and it contains about 8,000 images per camera and participant. This way, the gaze targets covered a field of view of 35° horizontally and 22° vertically.
The recording procedure was split into two parts for training and test data. For both parts, participants were instructed to put on the prototype and rest their head on a chin rest positioned exactly 510 mm in front of a display. The display was a 30-inch LED monitor with a pixel pitch of 0.25 mm and viewable image dimensions of 641.3 × 400.8 mm, set to 2560 × 1600-pixel resolution. On the display, the grid of gaze targets was shown, which the participants were instructed to look at. Each point appeared as a big circle 300 pixels in diameter and shrunk to a circle of 8 pixels diameter over the course of 700 ms. The small circle was then displayed for another 500 ms, until the display of the next point started. Data was only recorded during the latter 500 ms, i.e. while the small circle was shown (see Figure 7a). It is important to note that the chin rest did not fully restrain participants and we noticed that their head sometimes moved noticeably, thus resulting in a certain amount of label noise. Using the shrinking animation for the circle helps the participants to locate the circle on the screen and gives them time to relocate their gaze. Similar to [30], we also showed an “L” or an “R” in between every 20th pair of points in the sequence. The letter was displayed for 500 ms at the position of the last point. Participants were asked to confirm the letter they had seen by pressing the corresponding left or right arrow-key. This was done to ensure participants focused on the gaze targets and task at hand throughout the recording.