Operating system guide

All about Operating Systems

A / B / C / D / E / F / G / H / I / J / K / L / M / N / O / P / Q / R / S / T / U / V / W / X / Y / Z

Once integral part - even often the main part - of an OS, today more and more considered as an exchangable part of the OS. More and more OS's come with more than one file system on the market. So it's up to the user which one he uses. Nevertheless is the file system essential for the response times of the system - even in times of huge system-memories and fastest hard-discs. Although the file-system has undergone a considerable evolution, the perfect file system is yet to be found.

Before Shugart brought in the mid-70th his 8-zoll Floppy-disk-drives on the market only the high-priced mainframes had disk-systems. Shugart's floppy-drives were first one-sided(128 kByte), later double-sided(256 KByte), then in the beginning 80th the density doubled with the MFM-technique to 512 KByte(later 1024 and even 2048KBytes). Mid-range to low-range systems like the PDP's most often had at that time still tapes(DECTape). Even many VAX's had only tapes.

Although we can't go into much detail here, let's look at some stages: One early system was the Intel-ISIS system(mid 70th development system for 8008 and 8080). The Isis monitor - not really an OS - used a variant of the linked list for accessing the data on disc: before every block on disc was a header solely for the link data. This technique, which is an excellent technique for accessing data in memory, is the most awkward technique for accessing data on a disc: with this sort of layout you got to travers the whole chain of blocks to arrive at the position where you want to read or write. Random access was by design excluded. And since the processor was too slow to read a whole track in one go, it took multiple revolutions to get all blocks done. No wonder, the system and with it the processor had a bad reputation as beeing extremely slow.

CP/M was the first mass-system with estimated more than 4 million installations. CP/M had a very simple directory structure: at the beginning of the disc there was for every file a fixed entry in the blocktable, reserving a certain amount of blocks, if this wasn't enough, extension entries were needed. CP/M shared with this design the design-flaw of the Unix inode file system: you had to radically overdimension the blocktable or the number of creatable files was unecessarilly restricted once the filesystem was dimensioned. CP/M didn't know subdirectories, the fixed directory space could only be separated in several user areas.

Qdos, the predecessor of MSDOS, was a re-invention of CP/M ( this is often stated in literature, although it is not quite correct: The CP/M file system had much more in common with the Unix inode system than with MS-Dos)- with some (fortunate) misunderstandings. Tim Paterson -the one who wrote Qdos- laid the whole block-allocation-area as one piece at the beginning of the disc, with the pointers to the files following, which later opened the oportunity to have as many subdirectories as one liked. Because directories were now only normal files, no implicit limitation was given.

Today the world of file systems is more or less divided in three big parts: in the professional world the Unix file system - which is in it's essential parts still the inode system of the very first Unix - second there is the NTFS file system of Windows NT and in the semi-professional league and amateur-league you find the DOS/Windows FAT-System, today mostly in it's 32-bit variant.

File systems today

Many  file-systems are in discussion right now.  Transaction-
oriented (journaling-) file-systems which note every step they take and
which can therefore reverse/repeat their actions if necessary(eg  NTFS ), file systems
which behave much like data-base-systems or systems known as 'version 
control systems' from programming environments and extensions or 
modifications of other known systems. There are two topics which have
dominated and will dominate the discussion: speed and data-integrity. With
some minor modifications even systems with bad reputation in the latter
can be changed to well behaving systems as we will see below.(the term
security/safety  is used in the following in the sense of data-integrity or 
'integrity of the filesystem', not in the sense of security from intruders,hackers etc.)

Fault tolerance is one of the main issues of modern file system design. This has 
effects on all levels of the design of the file system. Redundant copies of allocation 
information, provisions for gracious recovery or even the multiple storage of 
the same file and/or versions of the same file are issues in this concept. One 
such file system is eg. the Elephant system. Even one step further go systems 
which do not overwrite older versions of the same file but keep them in difference 
files much like 'version control systems' known from programming environments.

 Despite the fact that todays hard disks are much more reliable than earlier ones
- to mention just one reason: the defect management of todays hard disks makes
bad spots nearly invisible - redundancy in allocation information and directory
information is still an issue. Not only power failure, system crash, program 
development demand for redundant information.

 But the ideal filesystem has yet to be invented. Eg the journaling filesystem
has inherent problems and brought new problems ( a journal itself can be
affected by errors,  integrity of a jfs in the millisecond to microsecond 
domain, journaling to ram, to NVRAM, what is if  NVRAM fills 
up.) and suffers from old problems(fragmentation, if  power fails during 
a write half-written files etc). Since journaling and database-systems 
are normally build on top of normal filesystems(eg the Unix-inode system)  
they inherit all the problems of the underlaying system. Journaling  makes 
a slow filesystem even slower:synchroneous I/O, when many files are 
created, renamed or changed cause additional write actions.

 A general preface may be in order: todays OS's are much more dependant
on a fast file system than earlier ones. The normal user associates with a file-
system only opening and closing of user-programs and reading and writing
by these programs. But a lot of hidden disc accesses are today associated with this
seemingly simple task: reading and writing a sometimes huge registry, loading dll's,
reading and writing (multiple) configuration files, macro files, password files and graphic 
symbol files, establishing links, writing log files, loading fonts, textures, drivers, 
reading/writing the cache-file.... Just to give you an impression: Simply starting a 
program like the Microsoft Developer UI causes 11.645 different OS-file-system-calls 
(open/read/write/close). And there are even more complex programs.  
Sequential reading and writing as well as random access are equally important. 
Fragmentation algorithms which help reduce fragmentation only partially  
are therefore only a half-way solution.

How slow todays file systems really are you get to know if you write programs
which must reference every file on  disc: indexing programs, virus checkers,
cataloging programs... It takes half an eternity to scan a 300 GByte hard disk.
Compact directory structures as in NTFS show the biggest advantages here. (Please
don't mix this up with fschk or scandisk: they can and they do circumvent performance 
critical parts! Nevertheless these too need half an eternity on a 300 GByte disk.) One of
the most heard questions I get on my pages:"where can a get fast grep?". Or "why
grep is so slow?".

  File-sytems get obsolete easily. I don't mean by this the kind of
systems which could be found till the beginning of the 80's, which were
based on linked-lists and therefore were incappable of random-access(Eg
ISIS).

  The High-Performance-Filesystem HPFS (OS/2) was obsolete right at the time of
its introduction into the market. It had the directories organized so that the
physical ways for the heads on the disk were optimized in
directory-operations. Directories were placed in bands on the disk so that the
mean way between data area and directory area was optimized.  The 3 main
time consuming factors which slow down disk access should get optimized
this way: step-time,head-settle-time and disk-latency(this is the time
the disk needs to rotate until the desired sector is under the head).
Although it gets easily forgotten that every cache has to be filled at
least once before it can function as a cache(which means on the other
hand that operations which reference an item only once have no benefit
from caching), the upcomming of note-worthy caches but more a new type
of disks made this system old-fashion: disks which show to the outside
an geometry which is (very) different from their internal real geometry.
This started when harddisks became intelligent and the disk-manufacturers
could implement a variable-sector scheme to improve the capacity of the
disk. Translation operations(LBA/CHS/...) used today can even turn the effect into 
its contrary. It is clear that a system, which tries to optimize physical ways
gets fooled by a disk which behaves in such a way. It's main design flaw
was to overemphasize the administration on disk, while at that time it was
already possible to administer the disk in Ram - with the right administration
algorithms naturally.  (The sentence with the cache-topic is not to be taken 
verbatim: caches are a complex subject. With read-ahead caches the 
statement may be false!)

  Since file-systems were treated in the course of discussion 
of the micro-kernel only as an exchangeable part of the system,
file-systems seem to get neglected what concerns their evolution: 'if
you don't like type x, take type y !' In reality this means often to
exchange one poorly engeneered file-system by another. The end-user, who
doesn't know anything about the internals of file-systems is left alone
with a decission he can't make. This leads to some monstrous
scientific discoveries in some Linux-Books eg, that Linux is not
affected by such earthly things as fragmentation , which
are reserved for the bad and ugly filesystems like the Dos-fat
or is by nature much safer etc etc.

  As an example lets take the file-system of  Plan9 (also see INFERNO),
the succesor of Unix at Bell-labs. It is essentialy the file-system of Unix with 
some minor modifications. (you may ask: which is the filesystem of Unix? Since Unix
behaves as a chameleon also in this subject, the question is not so far
away. But for our discussion here they got enough in common - up to Ext2fs and
Ext3Fs if we neglect journaling, in a restricted sense  even XFS, JFS, ReiserFs, 
since they are all i-node based -  to generalize the topic. If you want to 
read further, read eg MJ Bach, "The design of the Unix operating system" 
by Prentice-Hall, or take a look at the description of Ext2fs at sunsite.unc.edu 
or any other description which describes too the essentials.)

  The Unix file-system had an educative attitude: Since it was believed
at the time Unix was built that executables over 5kbyte could never get
debugged, every attempt to build exe's greater than 5-10k was punished.
(Am I  the only one to remember this statement?)
This was one of the intentions behind the poor direct adressability of
10 blocks (0-9) in the inode(with Linux Ext2fs 12, which makes no
difference).  Blocks were 512 byte big at that time, which gives a
filesize of 5k. If the file was bigger then 5k, indirect addressing
takes place which means that adress 10 adresses a block somewhere on the
disk which contains the next 128*  adresses. If the file was even bigger
than the so addressable 70 k, double indirect adressing took place:
adress 11 adresses a block somewhere on the disk which itself adressses
128*  blocks somewhere on the disk which contain the real adresses of the
blocks of the file. Guess what happens with adress 12 ? Indirect-indirect-indirect
addressing means that the first block points to 128*  blocks somewhere on
the disc which themselves point to 128* blocks somewhere on the disc which
at good last contain the real addresses. If you wondered
about the many 'somewhere's', the word 'somewhere' is the important
point. Since there is no dedicated part of the disk for the
adress-blocks nor for the directory-blocks, these blocks are really
totally scattered over the disk.  This means a real punishment took (and
takes ) place: since the big 14" hard-disks (and now forgotten drum-disks)
at those times were nearly as slow as floppies nowadays, the stepping time
plus the head settle time plus the disk-latency time get multiplied by
the indirect or indirect-indirect or even indirect-indirect-indirect
address finding process. Now remeber: we haven't read a single byte of
our file, this has also to happen in between. What's more, in the Unix file-system 
even the file-name is separated from the other two parts of the
directory-information in extra blocks. And last but not least, if all of
this isn't catastrophic enough, the directory-information as well as the
adress-blocks are totally normal file-blocks on the disk, meaning they
are subject to fragmentation as all normal files on the disk. A good
principle - non dedication - is turned here into its parody.

(* todays Unix systems changed from 128 to 256, which makes no difference)

 Caching can't cope with this scattering.  Absolutely no benefit can 
be expected from eg read ahead caching. One could - compared to the 
HPFS-filesystem described above - describe this way of getting 
the needed information: " Is there any  farther way we can take?"
or "is there any deviation we have omitted?" and the whole system as LPFS, 
or "least performing file system". This whole system misses any feeling for
the subtleties of a mechanical device such as the hard disk( to mention just one
fact: large swings of the head are much slower than short swings or single 
steps (or even no step), etc,etc..). Harddisk manufacterers could mark their 
harddisks which stand the Unix-test: "Unix approved".

  The file-name was separated from the other two parts to have an
elegant way to implement aliases in the form of hard-links which could
reference the same i-node-entry. The latter is the only change in the
Plan9-filesystem. File-names are now part of the i-node.  This seems to
be partly due to the late recognition that hard-links are essentially
worthless: they can only reference files in the same filesystem
(filesystem crossing references must be implemented with soft-links
which can easily be implemented in every filesystem) Many later systems
had shown: Links can be implemented at much less costs, even in systems
which originaly know nothing about links.(Even a link counter)

  A short repetition: The i-node is first placed on the disk (along with
some other information, boot, superblock, sometimes bitmaps for i-node
and data-area). It follows the data-area comprising data, adressblocks
and directory-information ( the latter beeing part of the i-node in
Plan-9). This means that once the i-node is dimensioned to, say 40.000
entries, this is fixed eternally. If someone has many small files, he
won't be able to fill the disk because there are no more i-node entries.
If some other have only really big files, much of the disk-space is
wasted because the over-dimensoned i-node eates valuable
no more available disk-space.

What's wrong with this file-system ? All and everything. It stands on
its head, it is at sixes and sevens. Since a disk (or a partition) has a
fixed number of sectors or blocks (or more generally capacity), this has to be the fixed
directory-information, the place which takes in unix the i-node. Since
files - especialy the number of files on a disk - is the variable part
which depends on the user and his applications and may change with time,
this has to be the extensible part of the directory-information, which
is fixed in the case of Unix in the inode. To fix this variable part
means that you unnecesarilly have to over-dimension the number of
permitted entries or the user gets at a certain point the message: 'Out
of i-node-entries' - despite the fact that the disk still has plenty of room.

Whats more the Unix-filesystem has absolutly no advantages
in beeing quick - quite the contrary as we have seen - or
beeing safe - there is no build-in redundancy whatsoever - although some
filesystems as Ext2fs try to implement some rudimentary redundancies.
With todays blocking factors of 4-8 kbyte and more - else the system
would show its slowness too soon - every little file of 200 byte
consumes 8kbyte(think of the many icons or the many scripts...) and
every, yes every file on the disk wastes 4 kbyte(with 8kblocking) at
least on the disk - the last block of every file is statistically always only half filled.
This means if you got a 50Gbyte disk with 800thousand
medium sized files on it you waste 800.000x4000= 3.2Gbyte or nearly 10%.
This was a short while ago a harddisk. Not included the wasted space by small
files(and a overdimensioned i-node). The BSD-file system uses disk-
space much better, at the price of even lower execution speed. XFS, JFS,
ReiserFS have realized that there is something wrong with the
whole scheme and implement extensions to the i-node and use B-trees
for addressing. If you want to read further, start reading
with eg Linux Gazette Nr55 "Journal File Systems".

  So - after all - if we put the Unix-filesystem on its feet, where do
we arrive at ? Essentially at the web-wide ( and in all books and 
magazines ) much blamed Dos- or Fat-filesystem. It 
has at the beginning of the disk in a fixed part for every block on the
disk exactly one entry, which tells to what file (if at all) this block
belongs. The filename/attributes and a pointer into this block-table is
the variable part, which finds a place somewhere on the disk (except the
root-dir/FAT16). So if Microsoft wouldn't have done such an ugly job in
patching in long filenames in the fat-system and if there would be only
a little bit more redundancy (although Dos-Fat has with it's second Fat 
more redundancy than many other FS) and/or at least error-checking, 
the fat-system could  at least in a true 32bit* or even 64bit implementation 
and with some minor modifications and improvements survive many of it's 
much praised competitors. *(only 28bits are used in 32-bit fat.)

 Back to Plan9: the Unix-filesystem had a certain logic: all this catastrophic
layout was a result of the wish to implement hard links. Now that this
prerequisite is given up in Plan9, the whole layout of the Plan9-filesystem
makes absolutely  no sense.

By the way: The ext2fs - blockgroups are a step in the right direction. But they
can't eliminate fragmentation altogether. I don't know if they go back on my 
proposal, but I made a very similar proposal (going even further, see below) 
in my below referenced book. This principle is easily adaptable to every file 
system.

A last word to the Unix file system. An investigation in the 80ths  showed that
the Unix file system delivers 5-7 percent of the maximum possible throughput. The 
results of this investigation were published everywhere. It seems nobody
has read it. And one thing should be clear: file systems have nothing in common 
with good old vine, which gets better the older it is.

And ah well, the FAT-FS has many problems as it is by now, but none that
couldn't be solved. This industry works with the foresight of a mole, one of 
the latest examples for this beeing the layout of the 32bit FAT system. 
Because it was absolutely not foreseeable that harddisks will ever surmount 
the 1-10 Gigabyte barrier, 28 bits out of the 32bits were considered to 
be enough (EIDE's support of only 28 bit* is a totally different problem - although 
not in the sense of farsight... And there is also SCSI.)  Now  there are 
the same problems as  4 years ago. And nobody  would say a word if 
such slips would happen to a one-man company... Or should this all be intention??


Finally...

So after our little tour through the world of file systems we can finally say
that all highly appraised systems till today have the same design flaw: they
interchange the fixed and the variable part of the administration information.
What did the Unix  developpers hope to gain by this scattering of  information 
which belongs together? It seems it is a psychological question: the amount of 
information looks smaller if you write it by and by. But in the end your disk doesn't 
get smaller by this ! You only have scattered  information which belongs together 
all over the disk! And this means you have produced an additional - unnecessary - 
cause of  fragmentation!

The argumentation that you can read concerning the 28bit Fat is of similar
naivety. At first: a 28-bit number consumes not a single bit less space than a 32-bit number.
At second: how stupid do you think we programmers are, that you suppose we would 
put the whole FAT 1:1 in memory? There are 1000 and 1 different possibilities
to compress or otherwise intelligently handle this information. In times of
CP/M we had to manage 32k of directory information in 64k of memory! And
still do some caching etc (and the OS and the programs too had to reside in 
these 64k). Compared to this the relationship between administration 
information and memory is today totally relaxed: a 8 MB FAT compared
to the 256 or 512 MB which you can find in the normal PC today is a 
relationship of 32:1 or 64:1 !

But also in other regards it seems when it comes to file systems normally very 
intelligent people get problems and escape to mythological explanations. To 
cite just one  of these often heard allegations : the Fat-system is unreliable 
because it has all information on one place (= the Fat)(you can read this 
allegation on many places, so please take this one only as an example. This 
is only a variant of the allegation that the Fat-system is unreliable because it has
all information on the first track.). Normally not necessary to argue against, but 
since you hear it that often: please, what's with  the MFT or the i-node? Are 
they protected by some mythological gods? Is there any proof that the first track(s)
accidentally more often gets overwritten than any other track?

Modern filesystem

Since at least twenty years it is clear how a modern filesystem should
look, no matter how organized. The first demand is: a filesystem
has to be fast. This means that the whole administration has to be done
as far as possible in Ram. A modified caching algorithm, which keeps the
interesting parts of the allocation map and directory information permanently
in Ram could secure this. There was and there is no problem in keeping this
information in Ram- normal caching is too slow and should be reserved for
the data area. (Since harddisk capacity has grown continously and ram
prices have fallen continously the ratio of allocation information to ram
capacity has been almost constant over the last 20 years. ) (Just to prevent any
misunderstanding: keeping this information in ram has no effect on the updating
of disk-data ! Allocation data on disk has always to be up to date by write through.)
Further this means in the case of block-oriented systems that the allocation-
information and directory information have to be compact. Besides minimizing
head movement this way full benefit from read ahead caching can be taken. Although stepping
times, head settle times and disk latency all have been reduced in the last years, they
all have been reduced only proportionaly to the grown capacities of todays harddisks.
And at good last this means a decission which is overdue since years and renders irrelevant many
of the above mentioned pseudo-new concepts: a buffered power-supply is needed,
which costs in mass-production a mere ten dollar or twenty more than todays
power-units, since buffering is needed only in the millisecond range (at least
in countries with weak energy supply).

 All other cases of file system inconsistency eg in program development are easy 
to handle in a well designed fault tolerant - "forgiving" - file system.  A good file 
system should be invisible, transparent to the average user, like ram or the cpu.

Fragmentation prevention

I was asked to give the fragmentation prevention algorithmen, so here it is: a good 
fragmentation prevention algorithm tends to produce by itself bigger and bigger 
holes of free allocation blocks. At  the same time it has to be fast, not consuming 
too much time for calculations. Now if you always simply take the biggest free hole in the 
allocation map and preallocate in this free hole a system specific medium sized 
area of allocation blocks for every  file to be written, you will observe that this 
algorithm tends to produce by itself bigger and bigger continuous free areas. Why is easy to 
see: since small free blocks are avoided (at least till the disk is nearly full) they 
accumulate and finally melt to bigger free blocks. This algorithm is thus quite the 
contrary to the 'dumb' allocation algorithm as it is implemeted in different variants in
most OS's today such as DOS, UNIX etc which simply take the next free block (or 
next 10,20,50.. free blocks).(preallocation is necessary because not only in a multitask/multithread
environment many files can be open which concurrently demand free blocks. After closing
a file the not used space is given back to the pool of free blocks.) This is a proven
algorithm that works (since the 80ths!). You simply need no more defragmenters  if 
your system defragments while writing.

Although fragmentation is on the normal user system no more that big an issue as it was - due to 
the big harddisks today - no fragmentation is better than little fragmentation. And in systems in
which even the administration information is subject to fragmentation a really good fragmentation 
prevention counts double and triple!


 So if you're going to construct a better file system,  always remember the topmost rule of 
every good engineer : Kis, keep it simple! A filesystem is no end in itself, it's academic beauty or 
correctness doesn't count.  It has to deliver two services: persistence and extension of the always 
too small system ram. Nothing more. But that fast.

(slightly actualized excerpt of the book:"Multi task,Multi user, Multi processor systems",R.Cooper-Bitsch,1989
Frankfurt/Thun.Written 1987 now 15 years ago.)

Journaling file system

In the case of abnormal system situations provisions must be taken for a gracious recovery of the system. While early systems used tools like scandisk or fsck to guarantee the safe state of the file system, most modern systems use some sort of Journaling or logging of metadata. While the process of scanning a disk is an undirected operation - the whole disc is scanned - the process of finding the fault in a journal is a directed operation and thus very much faster: we immediately find the place of error. Time - in this case uptime of the system - is the gained resource.

But we pay a high price for this gained time: Journaling - if not done to (battery backed-up) ram (or nvram or..) - can slow down a file system considerably - and thus the response time of the whole system in its every day usage. If the underlaying file system is already by design slow, this can be crucial. (For newbie's it may be necessary to mention that a journaling file system does not make superfluous tools like scandisk or fschk alltogether! The journal catches only file system inconstancy errors. Hardware errors eg must get detected as usual with fschk.)

To get a better insight into the problems of journaling we should first distinguish clearly the different reasons for recovery as there are: system crash, power failure, data consistency/integrity failure. Second we must distinguish the underlaying systems: professional usage, which most often have a power bakup (uninteruptable power suply) and "normal" systems with no provisions for power failure - lacking even a buffered power supply which could prevent in case of power failure the worst.

Most systems today log only meta data changes. File system consistency after a crash is guaranteed by these systems because all of the metadata updates are in the journal and can be applied to the file system for committed transactions. On the other hand this means for not finished transactions that the actual data is lost - eg the file to be written. Writes should therefore only get acknowledged after the data is written to its permanent position on the disk. But this means a further degradation in response time of the system.

After power failure or a system crash the actual data is lost - no metadata journal in the world will bring it back. In a multi task/multi user environment this can be quite a lot of data. This is the reason why there are by now systems emerging which even write the data itself into the journal (to write it afterwards by normal file system actions on disc). One reason for this strange double writing is the slow file system: a special area on disc is reserved for the journal which means nothing less than a circumvention of the slow Unix file system (the justification for this procedure is unmasking: the perceived!! write time should get lowered). One silly question: what is if power fails/ system crashes while we write our data to the journal?

Journaling to NVRAM does not suffer from the described problems - better said, it minimizes these problems - but has others. Especially if we want to use the just described intermediate saving method, our NVRAM soon will show that it is not endless. Another major drawback of writing the journal to NVRAM lies in the complications when sharing the state of the file system with another system is required....