next up previous
Next: About this document ...

Project #1
CpSc 829: Advanced Compiler Topics
Computer Science Department
Clemson University
Strict HTML Validation with Flex & Bison
Brian Malloy, PhD
January 29, 2007



In order to receive credit for this assignment, your solution must be submitted, using the handin command, by 8AM, Monday, February 12th, 2007. I will zip your files and move them to my directory at that time. You may submit your solution before the deadline as many times as you like; only your final submission will be considered.

The purpose of this assignment is to help you to become familiar with (1) scanning and parsing using flex and bison, (2) writing context free grammars, and (3) HTML.

Your assignment is to construct a strict validator for an HTML web page. The most important aspect of any HTML web page is the content, rather than the appearance. An HTML web page is composed of text and tags, where a tag is a pre-defined language directive that provides instructions to the browser about how to display or how to handle the subsequent content of the web page. There are two kinds of tags: empty tags (singleton) and non-empty tags (or balanced tags). An empty tag does not require an end tag, whereas a non-empty (balanced) tag does require an end tag.

The goal of this assignment is to construct a validator that is more strict than the typical validator found on the internet. The ideal would be for our validator to be similar to an XML validator where all tags are required to be balanced. However, almost all browsers permit empty tags (or singletons) and most web pages use empty tags; however, we will limit the number of empty tags, rather than completely forbid them. The empty tags that we will permit are:

  <area>
  <base>
  <br>
  <col>
  <hr>
  <img>
  <input>
  <link>
  <meta>
  <!-- -->
  <!DOCTYPE>
Thus, all other tags, such as p, for paragraph, are required to be balanced.

As a further restriction, our validator will require the use of html, head, title and body tags and will require that these tags be non-empty (balanced).

Make sure that your validator can handle a web page that contains tables in the body of the document. Of course you should be able to skip comments, since most web pages us them and this is an obvious use of states in flex.

Your program should also accomodate attributes within tags, and parsing attributes in tags may be the most difficult part of this assignment. Attributes, in HTML, can take the following form:

  1. Identifier
  2. Identifier $=$ Number
  3. Identifier $=$ Identifier
  4. Identifier $=$ quoted text

However, XML and XHTML demand that all attribute values inside tags be enclosed in quotes; thus, our HTML validator will disallow the second and third options and require that all attribute values inside tags be enclosed in quotes. In the following example, the first example is correct, but the second one is incorrect:

<table cellspacing="0" class="address">
<table cellspacing=0 class="address">

Finally, since XML is case-sensitive, your HTML validator should also be case sensitive. Note that the XML and XHTML standard demands that all tag names be lowercase.

Make sure that you comment, in your README, about how you handled attributes in tags. Also, I would like to place some student solutions on my web page, since they are frequently both interesting and provide informative alternative approaches to my solution. Please comment in your README about whether or not I may place your solution on my web page.

I will provide a grammar and some code to get you started and I will explain this initial grammar during lecture. Please feel free to use any code that I provide on my web page. Also, I will discuss the HTML table facility. I will also provide a Python regression test script so that you (and I) can automate regression testing, and I will show you how to use the debug facility in both flex and bison.

Your submission must include a README, a Makefile, the regression test script written in Python, and a testsuite directory containing positive and negative test cases. Please state in your README your name, the course title and number, a few sentences describing the assignment, the strengths and weaknesses of your validator, and anything extra that you have incorporated into the application. Please feel free to comment on the assignment in any meaningful manner.

You do not need to heavily comment your scanner or parser; a few comments at the beginning of each should do it. Also, please place a few comments at the beginning of your main program that describes you and the assignment.

The handin command:

        handin.829.1 1 *




next up previous
Next: About this document ...
Brian Malloy 2007-01-29