Data Collection Method

We began this effort in July 2000 by manually interrogating some online personals Web sites. Most of the larger Online Personals sites ask men and women to enter their height and the preferred height of the man or woman they are seeking. However, the sites did not show the preferences. The preferences were used only for the search function. The one exception to this was the Netscape/AOL Personals Web site. This site listed many attributes of the person and their preferences, including minimum and maximum preferred height.

To get a sense for what the data might look like, we manually looked up each woman's entry and entered her statistics into an Excel spreadsheet. At this point we were only interested in determining the important of male height for women. Based on various news reports and shared personal experiences, we knew that men's height were a significant factor for a woman. Women generally do not prefer short men. However, it is commonly believed that short women do not have the same difficulty finding a date that short men do. In a future study, we will analyze the Netscape/AOL Personals database to determine mens' preferences with respect to height.

Manually retrieving a limited set of personal information was an arduous, time consuming and error prone activity. Each record took about 27 seconds to create. We considered a minimal sample size to be 100 entries for each of the 50 states. The manual data collection would have taken 37½ hours to complete, sitting in front of a computer screen. It would not have generated very good data since more populous states should have provided proportionally more records. Also, only the minimum number of attributes could have been collected: age, height, minimum preferred height and maximum preferred height. Adding other attributes such as Race and Body Type would have increased the data collection time significantly.

To solve this problem, we wrote a Java program that automatically accessed to the Netscape/AOL Personals Web site and retrieved the appropriate data. The program used a fairly unsophisticated technique called screen scraping to obtain the information that was required. Screen scraping has been around for many years, long before the Web. Rather than designing an interface between two systems, screen scraping requires the program to look at ALL information coming back from the other system, in this case the Netscape/AOL Personals Web site, and search for specific strings to locate the information needed. This means the program needs to know the exact layout of the other system. Any changes to the other system will break the program. The program can only be used with the Netscape/AOL Personals Web site and would not be easily extendable to another personals Website.

When you go to the Netscape/AOL Personals Web site, you are given several choices, one of which is browse. The is the choice the program uses. Next the Web site presents a list of states and upon clicking on a state, you are provided a list of cities. When you click on a city, you are given a list of the first 50 people in the right frame and the first person's information in the two left frames. As you click each name in the right frame, a new person comes up in the two right frames. At the end of the list there is a link to the next 50 people. The program duplicates the user's behavior.

For a given state, the program retrieved the list of major cities, in that state, and the first 50 people for that city. One by one, it then calls up the pages of each individual. Using simple pattern matching, the program retrieved the relevant attributes and preferences that the author called for. When it came to the end of the list of 50 people, it automatically went to the next page of 50 people. Upon reaching the end of the last page of 50 people, it went on the next city. The data collection was complicated somewhat because of the use of frames. Two separate pages need to be retrieved for every person.

The program was written so that at least 51 runs would be required, one for each state plus the District of Columbia. While the program could have been completely automated, we chose not to because:

  1. This would have added some complexity to the program

  2. We wanted to monitor progress during data collection.

  3. We were concerned about generating too many sequential hits on the Netscape/AOL Personals Web site. We did not expect that Netscape/AOL would want their database opened up in this manner. So many hits from the same IP address could have raised an alarm and we did not want our access to be denied.

We started with the less populated states. In the beginning, errors cropped up that would cause the program to terminate or produce invalid records. Starting with the less populous states allowed us to fix the problems before attempting to run through the larger states like New York and California. Connectivity problems also caused the program to terminate prematurely. Still it was not uncommon for the program to run through up to 10,000 records before a problem was detected. On average, the program could complete a record in 0.7 seconds, or about 5,000 an hour. A dramatic improvement over the manual technique.

For each record, 10 attributes and 7 preferences were retrieved and stored in flat files, one file for each state:

Woman's
Attributes
Woman's
Preferences
   City* Minimum Height
* State* Maximum Height
   Unique Identifier   Body Type
* Age   Race
* Height   Education
* Body Type   Religeon
* Race   Minimum Salary
   Religeon 
   Education 
   Salary Range 

The Attributes and Preferences with a * were used in the initial analysis. Others may be included at a later time. It will be interesting to determine if a woman's Education, Religion or Salary plays a role in choosing a date. All attributes used a list of key words but some let the person also enter a descriptive phrase. For example, a Body Type might be listed:

Average (Curves in all the right places)

Though interesting, the descriptive text was removed because it is difficult to do analysis on free form text.

Any record that had a null value in any of the primary attributes was omitted by the program. We also discarded preferences that were unlikely to be meaningful: any height below three feet or above eight feet. Once debugged, the program was run over the course of two weeks, generally in the late evening when Internet response time was favorable. We estimate that it ran for about 34 hours.

Once all of the records were retrieved, a mysql database was created on the same server that runs this Web site. To minimize disk space usage and improve run time performance, all attribute values were reduced to a single letter.

Another program was then written, also in Java, called the Personals Analyzer. This program runs as a applet in the user's browser. Each time the user selects an attribute value, the applet program executes a CGI program on the Web Server. The CGI program queries the database for the appropriate information and returns it to the applet. The applet then displays the data as a bar chart.