System calls are the fundamental API provided by an operating system
to application programs. For example, the read()
and
write()
system calls are the basic mechanisms for doing file
input and output in Linux. All other file I/O routines, such as C++'s stream
operators (<<
and >>
) or Python's
file methods (file.read()
and file.write()
)
are built upon these C-interface system calls.
To make sure we never take these high-level interfaces for granted, we're going to work with Linux's system calls directly! We will use Linux's file manipulation system calls to write a program that compresses and decompresses files.
In this lab, you will:
open()
, close()
, read()
,
and write()
system calls to do file I/O
This assignment can be completed in groups of three, though you may work by yourself or in a group of two if you wish.
rle
- compress or decompress a file with run-length encoding
rle <input file> <output file> <compression length> <mode>
input file
: the file to compress/decompress
output file
: the result of the operation
compression length
: the base size of candidate runs
mode
: specifies whether to compress or decompress- if mode=0, then compress the input file, if mode=1 then decompress the input file
rle
implements run-length encoding for the compression of files.
You must use the read()
and
write()
system calls (documented at man 2 read
and
man 2 write
, respectively) to read the input file and write to
the output file. Additionally, your program should only output runs of
up to length 255, this is so the run-length specifier is guaranteed to
always be one byte.
Run-length encoding is a compression technique that identifies "runs" of
repeated characters and represents these compactly. The length of each run
is counted, and the base is stored along with the number of repetitions
of that basis. For example, the string AAABBBBBB
consists of
nine bytes, but it could be instead represented as 3A6B
, where
"A" and "B" are the basis of each run and the numbers give how many times
each base is repeated.
The base does not need to be length-1, and this is what the third program
parameter above specifies. For example,
the string ABABABCDCDCDCD
compresses very poorly with length-1
encoding to 1A1B1A1B1A1B1C1D1C1D1C1D1C1D
, which is an expansion
from 14 bytes to 28 bytes. However, if we allow our base to be length-2 then
we can represent the above as 3AB4CD
, which is a reduction from
14 bytes to 6 bytes.
Each run should be a maximum of 255 repetitions. This is so the run-length specifier can always be represented as a single byte (recall that an 8-bit unsigned integer can store values from 0-255). For example, if you had the character "A" stored 300 times in a row, then with a length-1 encoding your program should produce "255A45A" rather than "300A".
One can imagine many versions of RLE that are optimizations of the above principle. For this assignment I ask you to implement the simplest algorithm that mimics the behavior given above. In particular, given a run-length parameter K, then starting at the beginning of the input file:
Decompressing files is much easier:
rle
detects the following errors and quits gracefully:
strtol()
)
open()
, close()
,
read()
, or write()
- use the
function perror()
to print useful error messages
Upon encountering any error, print a useful message and exit()
with a negative status code.
If no error is encountered then the program should not produce any output to standard output, and should only modify the output file.
You can download these files to your local machine with the
wget
program from the Linux command line. See
man wget
for details.
You can use the xxd
program to inspect your program output.
Many text editors do not handle non-printable characters gracefully, but
xxd
prints the underlying binary data in hexadecimal. For example,
the test1
test case contains the sequence
AAAABBBBBBBBCCCCCCCCCCCCAAAAAAAAAAAAAAAA
followed by a new-line
character. The xxd
program generates the following output:
[dferry@hopper lab1]$xxd test1 0000000: 4141 4141 4242 4242 4242 4242 4343 4343 AAAABBBBBBBBCCCC 0000010: 4343 4343 4343 4343 4141 4141 4141 4141 CCCCCCCCAAAAAAAA 0000020: 4141 4141 4141 4141 0a AAAAAAAA.
The left column gives the location of the displayed data within the file. The middle section displays sixteen bytes of hexadecimal data (remember that one hex number describes four bits, so two hex digits describes one byte). The right section displays the printable-text equivalent of the hexadecimal data. Looking from left to right we see byte 0x41 repeated four times (ASCII code for 'A'), then 0x42 repeated eight times (for 'B'), then 0x43 repeated twelve times (for 'C') and then 0x41 repeated sixteen times (for 'A' again). The last byte in the file, 0x0a, is the New Line character: since this isn't a printable character, it is represnted with a period in the right hand side display.
Thus, compressing the first file acording to the specifications above gives the following outputs:
[dferry@hopper lab1]$./rle test1 compressed 1 0 [dferry@hopper lab1]$xxd compressed 0000000: 0441 0842 0c43 1041 010a .A.B.C.A..
Notice we have the value 4 followed by the code for 'A', then the value 8 followed by the code for 'B', then the value 12 (hexadecimal code 0x0c) followed by the code for 'C', and the value 16 followed by the code for 'A' again. Continuing:
[dferry@hopper lab1]$./rle test1 compressed 2 0 [dferry@hopper lab1]$xxd compressed 0000000: 0241 4104 4242 0643 4308 4141 010a .AA.BB.CC.AA..
[dferry@hopper lab1]$./rle test1 compressed 4 0 [dferry@hopper lab1]$xxd compressed 0000000: 0141 4141 4102 4242 4242 0343 4343 4304 .AAAA.BBBB.CCCC. 0000010: 4141 4141 010a AAAA..
[dferry@hopper lab1]$./rle test2 compressed 1 0 [dferry@hopper lab1]$xxd compressed 0000000: 0142 0341 0242 0141 0242 0143 010a 0141 .B.A.B.A.B.C...A 0000010: 0142 0241 0142 0141 0142 0341 010a 0541 .B.A.B.A.B.A...A 0000020: 0143 0441 010a .C.A..
[dferry@hopper lab1]$./rle test3 compressed 1 0 [dferry@hopper lab1]$xxd compressed 0000000: ff41 2d41 010a .A-A..
read()
and
write
system calls with studios 1 and 2 before attempting
this lab assignment.
open()
carefully. The output
file may not exist, or if it does exist you should overwrite it starting
at the beginning. Thus, you should specify the O_CREAT and O_TRUNC flags, and
because you specify O_CREAT you should also specify (at a minimum) the
S_IRUSR and S_IWUSR file permissions.
diff
program to compare before/after files and it highlight any
differences. This is an easy way to detect whether or not a decrypted file
is identical to the original source, especially when the files are too large
to inspect visually!
ls -l
command to see how many bytes are
in each file.
read()
system call returns how many bytes it has read.
This is useful info- you might ask the read()
function to read
in four bytes, but it might not be able to, such as if you're at the end
of a file. This also tells you when there is no more input to be read: you can
keep reading until read()
returns a 0 (end of file) or -1 (error).
The following man pages may be useful:
open
(2)
close
(2)
read
(2)
write
(2)
strncmp
(3)
memcpy
(3)
atoi
(3)
strtol
(3)
perror
(3)
errno
(2)
exit
(3)
xxd
(1)
diff
(1)
Which file do you think will compress best? Compute the compression ratio for each file with a compression length of one.
time
, which records how
long a command runs. Record how long it takes to compress the file test4
.
The syntax in this case is "time ./rle test4 output 1 0
".
This will report three measurements: real, user,
and sys. How long does your program take to run in real time?
The compression algorithm you have implemented is a form of lossless compression, meaning that the entire uncompressed file can be reconstructed and no data is lost. This approach works well for highly structured data with lots of repitition. This can happen naturally in a case like slu_logo.bmp, where the whole image has just a few colors and lots of large same-color regions. Compute the compression ratio of your algorithm with slu_logo.bmp. Then, open the file in a photo editor (such as Microsoft Paint, Photoshop, or GIMP) and save it as a .png file (lossless compression) and a .jpg file (lossy compression). Compute the compression ratio for each method.
Please submit your lab to your course git repository. Your code should be
entirely contained in a file called rle.c
, and your
question responses should be included in an appropriately named text file.