FIT3084: Behind the (World Wide) Web


In the previous lecture:

In this lecture:


What is the Internet?

Internet Use Statistics

Selected Internet use statistics from 1990 to 2006 (from Gap Minder).

Have a look at Gap Minder in your own time to answer the following questions.

  1. Which country currently has the largest percentage of Internet users?

  2. Which country has the largest number of Internet users? How long as this been the case?

  3. Which country has the lowest percentage of Internet users?

  4. What do these statistics mean for you as a web site publisher?

  5. How does wealth relate to Internet usage around the world?

What is TCP/IP?

Transmission Control Protocol/Internet Protocol is a low-level protocol by which Internet computers of different makes, models and operating systems communicate.


How is information retrieved from the Internet?

Use one of the (many) high-level protocols and its software user interface.

As well as Gopher, WAIS... find out what these are (were) by doing a little web surfing!


File formats: storing information on the Internet

There are thousands of different file formats.

A file format is a particular way of storing or ordering information in a file.

The specification of a file format includes information regarding what goes into a file, and the order it is written/read.

Here are some you might find on the web:

  • PostScript / EPS

  • RTF

  • LaTeX

  • troff

  • SGML

  • PDF

  • Plain text

  • Proprietry word-processor formats
  • AIFF, MP3

  • GIF, JPEG, PSD, PICT, PIC, PNG, RGB, SGI, TGA, BMP, RAW, SUN, TIFF... these are a few of many file formats for images.

  • Quicktime

  • VRML, XHTML



...the list goes on and on.

Special software is needed to view, hear, play, read, interpret or edit any file format.


The problems for Internet information retrieval.

  1. Q. Where do I look?

  2. Q. What software do I use to look for and retrieve the files?

  3. Q. How do I use that software?

  4. Q. What file formats do I need to be able to interpret/decode to find the answer?

How does the WWW relate to the Internet?

cern logo

The WWW began in 1989 at CERN lab to help simplify the retrieval of information from the net.

See the WWW's 20th birthday celebration page.

The idea underlying the WWW is that a user is able to transparently jump around the global Internet retrieving information without worrying about the 4 problems posed above.

Now, to answer the questions above...

  1. Q. Where do I look?
    A. The WWW

    The Web glosses over the hundreds of individual computers, directories etc.

  2. Q. What software do I use?
    A. A Web browser

    Only a single piece of software! The browser communicates using several high-level protocols and eliminates the need to master numerous pieces of software.

  3. Q. How do I use the software (web browser)?
    A. By clicking the mouse on hyperlinks or selecting them from a menu.

    What could be simpler? Previously, software was used by typing cryptic commands into command-line user interfaces.

  4. Q. What file formats do I need to decode?
    A. None, the web browser handles that for you!

    The modern browser will (sometimes with the help of plugins and helper-applications) display images, play sounds, layout text and interpret a diversity of file formats without you needing to lift your finger from the mouse button! You take this for granted now, but things were not always this simple.

The size of the indexed WWW


Number of webpages

Size of the Web Statistics

GYWA = Sorted on Google, Yahoo!, Windows Live Search (Msn Search) and Ask
YGWA = Sorted on Yahoo!, Google, Windows Live Search (Msn Search) and Ask

The Indexed Web contains at least 22.53 billion pages (Monday, 20 July, 2009)
See World Wide Web Size for details of their estimation technique


Making a personal mark on the WWW.

Originally, the WWW contained information posted by a few companies, research organisations or university academics who had or hired resources and skill to build a web-site and set up a web-server.

under construction

Personal homepage: a homepage was the original way to make a personal mark on the WWW. These were always "under construction" and often out of date due to the amount of time it took to maintain them.

Nevertheless, they are still popular due to their flexibility and the availability of software to edit HTML web pages easily and in a WYSIWYG fashion.

facebook

Social networking.
Sites such as MySpace, and lately Facebook, have become the most popular ways to make a mark and to interact with other like-minded people. These sites allow users to quickly and easily establish and maintain an online presence as long as they are happy with the restrictions the sites prescribe.

blogger

Blog.
Rather than just having a space online, people with something to say use a web-log (blog) to make posts that may include text, images, music, and links to other information. Readers follow these like they'd read a daily newspaper, or by subscribing to a feed. Readers can also comment on the posts.

twitter

Micro-Blog.
Sites like Twitter permit people to publish even the most mudane aspects of their lives in short snippets called tweets.

delicious

Content sharing.
Del.icio.us, Flickr, YouTube are content sharing websites to which users can upload links, images and movies for others to watch and comment on.

youtube flickr

Other Notable Applications of the WWW.

Amazon

Find and buy goods from large retailers anywhere around the world and have them shipped to your door.

eBay

Find and buy (especially second-hand) goods from small retailers anywhere around the world and have them shipped to your door.

PayPal

Pay for things securely over the Internet using a credit card.

Google

Find places, look at street views and aerial photographs of (nearly) anywhere!
Find web pages, images, scholarly papers, books online.

Commonwealth Bank

Do your banking and pay your bills online.

Bureau of Meteorology

Receive current data on currency exchange rates, stock prices, traffic flow, weather, sporting results...


Identifying files on the Internet.

big bug talks to little bug

The Internet is a global network (of networks) of computers.

Every computer on it has a unique numerical address (an IP address) and a people-friendly equivalent. You can find out the IP address of a machine using the UNIX host command (type man host at a UNIX prompt to see how it works).

130.194.64.140 ...is the numerical address for our department's old web server... shelob.csse.monash.edu.au

 

The Internet is divided into domains, and subdomains.

shelob

is the machine name.

csse

is the Computer Science and Software Engineering subdomain.

monash

is the Monash University domain.

edu

indicates the address is educational.
What other extensions are there for different types of institutions?

au

indicates the address is Australian.
What other extensions are there for different countries?

Every file on a computer has a filename unique for that machine. When appended to the IP address of its host computer, every file on the Internet therefore has a unique name.


Steps for Retrieving Documents from the Web.

Computers on the Internet called name servers keep lists of numerical IP addresses & people-friendly names and translate between them.

1) A web browser (client) sends a request using HyperText Transfer Protocol (HTTP) for a document, specified by its unique name, to a remote (server) machine.

The unique file name is specified within a Uniform Resource Locator (URL)...

Protocol://server_domain_name/file_path

The protocol may be omitted within some web browsers in which case HTTP is assumed.

Absolute URLs

http://www.csse.monash.edu.au/~aland/index.html

ftp://ftp.cs.monash.edu.au/pub/

are absolute because they include a domain name and a path.

Relative URLs

index.html

../index.html

are relative because they specify a path and domain name by reference to (usually) the URL of the file currently open in the browser (often referred to as the base).

Locations within documents

http://www.csse.monash.edu.au/~aland/index.html#chapter

index.html#fred

The text after the # symbols indicates a location within the document specified by the URL.

These locations are named whilst the document is being created. The #location is an optional part of a URL. When would it be useful?

2) A web server program on a remote machine always listens on a well-known port for incoming requests. (Port 80 for HTTP)

3) The web server checks client access privileges, if all is well, it sends the requested document.

4) A browser displays the document retrieved from the server on the client machine in human-readable form

A web document is anything accessed with a single request from a client to a server.


Try this in your own time...

Commands to type.

Explanation.

telnet www.csse.monash.edu.au 80

Telnet to the school's WWW server (on port 80)

GET /index.html HTTP/1.0

Access the web page "index.html" using the GET command which the browser would normally do for you. Follow your command with two carriage returns.

>> The server should send you the HTML of file "index.html"

See? The protocol isn't magic, you can participate in it manually.



This lecture's key point(s):


Courseware | Lecture notes

©Copyright Alan Dorin 2009