Identifying and extracting structured data from web pages
This paper proposes a method for automatically identifying and extracting information that matches a predetermined criterion from one or more web pages at one or more web sites and automatically producing one or more extracted data-field names from the information extracted from the one or more web pages at the one or more web sites. The extracted information includes at least one extracted data-field value associated with one of the one or more extracted data-field names. If one of the extracted data-field names matches an existing data-field name in a previously constructed database including one or more data fields each associated with a data-field name and a data-field value, the method updates an extracted data-field value associated with the data-field name in the database. If one of the extracted data field names does not match any of the existing data-field names in the database, the method adds the extracted data-field name to the database.