============== File Archives ============== **Due: Wednesday, March 13th at 11:59pm** (Notice the slightly later-than-usual deadline.) *You must work alone on this assignment.* The goal of this assignment is to give you practice with class inheritance and compressing data. Data compression ----------------- Data Scientists have to work with data of varying sizes. Data may come from a wide-range of places such as on the internet or from physical storage devices. As we all know, computers can take significant amount of time to download, transfer, and/or upload a large data file. *Data compression* is one way to help ease this burden. Compression takes some input data and encodes it using fewer bits than the original file. Essentially, it is a way of reducing the amount of space a file takes up on a storage device. Once compressed, the data must then be decoded before you can interact with it again. This process of compressing and decompressing a file can vastly reduce the storage and bandwidth resources necessary to acquire, view and manipulate the data. The trade-off in reducing data size via compression is that we spend additional computational resources to encode and decode the data. Compression works by using a formula or algorithm to determine how to reduce the size of the data. For example, text compression can be implemented simply by removing all unnecessary characters, or inserting a single repeat character to indicate a string of repeated characters. Most compression techniques fall into two categories: *lossy* and *lossless*. When a file is compressed using a lossy algorithm, the data is compressed approximately rather than exactly. Once decompressed, the data will roughly be the same but not necessarily identical to the original data. This technique is fine for multimedia applications transferring compressed video on the web. The subtle video quality difference is often not noticeable by the human eye. Lossless algorithms ensures that when a compressed file is decompressed, it matches the data exactly as it was before compression was performed. These algorithms are usually used for text and database files. For this assignment, we will focus on implementing a lossless algorithm since we are solely working on textual data. Run-length encoding ------------------- Run-length encoding is a lossless algorithm that identifies repeated *runs* of data (that is, contiguous sequences of the same data value) and represents those runs as a single instance of the value and the length of the run. For example, given a string containing the following characters:: WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW The run-length encoding for the above text is:: 12W1B12W3B24W1B14W This is interpreted as a sequence of - twelve Ws, followed by - one B, followed by - twelve Ws, followed by - three Bs, followed by - twenty-four Ws, followed by - one B, followed by - fourteen Ws The algorithm compressed the original data of 67 characters into only 18 characters. Run-length encoding is heavily used in multimedia data, where this repetition occurs more frequently. Here are a few more examples:: DDDDDDDCCCCCCCCCODDDDDEEEE -> 7D9C1O5D4E 337444MMMMMABC -> 2317345M1A1B1C As you can see with the last example :: 337444MMMMMABC -> 2317345M1A1B1C it can be quite difficult to differentiate the counts and actual data when the data contains numbers. Another method of the run-length algorithm is to encode run lengths for runs of two or more characters only, using an "escape" symbol to identify runs. Most implementations use the character itself as the escape; therefore any time a character appears twice it denotes a run. Using this method, the above example :: WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW can be represented as:: WW12BWW12BB3WW24BWW14 Most implementations using this method tend to separate the data and escape symbols from the run lengths, so that the two can be handled independently. The algorithm produces two results. The encoded string (minus the run lengths):: WWBWWBBWWBWW and the run lengths separately (e.g., as a list):: [12,12,3,24,14] For this assignment, you will implement run-length encoding where the run-lengths are separated from the encoded string. Decompression ~~~~~~~~~~~~~ Given the encoded string and run lengths separately:: encoded_str = "WWBWWBBWWBWW" run_lengths = [12,12,3,24,12] decoded_str = "" If you encounter a sequence in the ``encoded_str`` with two characters that are the same then pull the first number ``N`` off the ``run_lengths`` list and duplicate the character ``N`` times inside the ``decoded_str`` string. Otherwise, insert single characters directly into the ``decoded_str`` string. For example, starting from the beginning of the ``encoded_str``: - ``"WW"`` matches to the number at ``run_lengths[0]``: ``"WWWWWWWWWWWW"`` - ``"B"`` is a single character. Insert ``"B"`` into the ``decoded_str``: ``"B"`` - ``"WW"`` matches to the number at ``run_lengths[1]``: ``"WWWWWWWWWWWW"`` - ``"BB"`` matches to the number at ``run_lengths[2]``:``"BBB"`` - ``"WW"`` matches to the number at ``run_lengths[3]``: ``"WWWWWWWWWWWWWWWWWWWWWWWW"`` - ``"B"`` is a single character. Insert ``"B"`` it into the ``decoded_str``: ``"B"`` - ``"WW"`` matches to the number at ``run_lengths[4]``: ``"WWWWWWWWWWWW"`` Concatenating the strings above will then produce the original data:: decoded_str = "WWWWWWWWWWWW" + "B" + "WWWWWWWWWWWW" + "BBB" + "WWWWWWWWWWWWWWWWWWWWWWWW" + "B" + "WWWWWWWWWWWW" is equal to:: "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW" Repeated Words -------------- Another type of lossless algorithm is one that replaces repeated words in a given text with an unique encoding. For this assignment, we will define a word to be any sequence of characters, where each sequence is separated by a whitespace. For example, let's say you had the following text: :: text = "If you fall I will catch you, I will be waiting" The unique words in this text are the following: :: words_in_text = {"If","you","fall","I","will","catch","you","be","waiting"} If any of the unique words in the given text are repeated then we will replace them with an encoding as follows: 1. Define a unique counter and set it to zero (``unique_counter = 0``). 2. Define a delimiter value and assign it to any single character (e.g., ``delimiter = ":"``). 3. For each unique word and if the word is repeated then - The encoding will be the ``chr`` of the current value of the unique counter (``encoding = chr(unique_counter)``). - Increment the ``unique_counter`` by 1. This allows for the next repeated word to have an unique encoding. - Replace the repeated word in the text with the ``delimiter`` followed by ``encoding``. The above example using ``text`` will have the following encoding: :: text = "If you fall I will catch you, I will be waiting" encoded_text = "If you fall :\x00 :\x01 catch you, :\x00 :\x01 be waiting" Note you **must** process the words in the order they appear in the text. For example, if ``"I"`` and ``"will"`` are the first repeated words in ``text`` and ``"I"`` comes before ``"will"``, then ``"I"`` encodes as ``encoding = chr(0)`` and ``"will"`` encodes as ``encoding = chr(1)``. In order to later expand this encoding, you must also build a dictionary that maps the encoding for a repeated word to the original word, :: encoded_text = "If you fall :\x00 :\x01 catch you, :\x00 :\x01 be waiting" encoded_dict = {":\x00": "I", ":\x01": "will", } When decompressing the ``encoded_text``, you will use the ``encoded_dict`` to replace the encoding with its original word. Task 1: Compressors ------------------------------------- Your first task is to implement two of classes that will inherit from a base class called ``Compressor``. These compressor classes should provide methods that compress and decompress text data using the lossless algorithms described earlier. Specifically, you will implement the following classes: - ``RLECompressor`` - a class implementing the run-length encoding algorithm. - ``WordCompressor`` - a class implementing the repeated words algorithm. - ``Compressor`` - the base class for the two above classes. You will finish the remaining unimplemented methods and properties. We have seeded your ``pa6`` directory with a file named ``compressors.py``. You will find stub versions of the ``Compressor``, ``RLECompressor``, and ``WordCompressor`` that you will finish implementing. Compressor ~~~~~~~~~~~ All compressors contain an internal buffer for holding text data that will eventually be compressed or decompressed. Think of this "buffer" as a simple string where new data will be appended to it via a method called ``read``. For example, :: compressor = Compressor() # buffer = "" compressor.read("WWWWWWWWWWWW") # buffer = "WWWWWWWWWWWW" compressor.read("BB") # buffer = "WWWWWWWWWWWWBB" As stated earlier, ``Compressor`` represents the base class that all other compressor classes will inherit from. We have already implemented the following methods, - ``__init__`` - the constructor defines the internal buffer attribute and sets it to be empty - ``read`` - this method adds on text to the internal buffer as shown in the above example. - ``clear`` - this methods resets the internal state of the compressor. Think of this method as clearing out the compressor to allow for new data to be compressed/decompressed. The base ``Compressor`` class acts as a do-nothing compressor. It does not actually compress or decompress any of its internal data, but rather the ``compress`` and ``decompress`` methods should return **exactly** the contents of internal buffer. For example, :: compressor = Compressor() # buffer = "" compressor.read("WWWWWWWWWWWW") # buffer = "WWWWWWWWWWWW" compressor.read("BB") # buffer = "WWWWWWWWWWWWBB" s = compressor.compress() print(s) # prints "WWWWWWWWWWWWBB" s = compressor.decompress() print(s) # prints "WWWWWWWWWWWWBB" Finally, you will implement the ``ratio`` property. This property determines the ratio between the uncompressed size and compressed size of the internal buffer. :: ratio = uncompressed size / compressed size Usually, this computation is done at the file level in terms of bytes, megabytes, etc. For this assignment, the size is represented by the number of characters in the compressed and uncompressed buffer. Finish the implementation of the ``Compressor`` class by completing the ``compress`` and ``decompress`` methods and ``ratio`` property. RLECompressor ~~~~~~~~~~~~~ The ``RLECompressor`` class is a compressor that contains an implementation of the run-length encoding algorithm. This class **must** inherit from the ``Compressor`` class and contain the following constructor: :: def __init__(self): ''' Constructs a run-length compressor that initializes the internal buffer and clears its run_lengths property. ''' The ``RLECompressor`` class **must** define a property called ``run_lengths``. This property represents the list of run_lengths once the buffer becomes compressed. For example, :: compressor = RLECompressor() compressor.read("WWWWWWa111BB") print(compressor.run_lengths) # prints [] s = compressor.compress() print(s) # prints "WWa11BB" print(compressor.run_lengths) # prints [6,3,2] The run-length algorithm is implemented inside the ``compress`` and ``decompress`` methods: :: def compress(self): ''' Computes the run-length encoding on the internal text buffer. Only consider runs of length greater than or equal to the class constant RUN_COUNT (i.e., 2) Update the run_lengths attribute when computing the runs for text data. Output: A string of the compressed data ''' def decompress(self): ''' Restores the uncompressed string (i.e, original data) from the encoded text data buffer. Use the run_lengths property to retrieve the run_lengths for the buffer. Output: The restored original data as a string ''' The following is an example of using the same ``RLECompressor`` object to compress text data and then uses the ``decompress`` method to restore the original data. :: # Perform compression compressor = RLECompressor() compressor.read("WWWWWWa111BB") encoded_str = compressor.compress() encoded_run_lengths = compressor.run_lengths print(encoded_str) # prints WWa11BB # Perform decompression compressor.clear() compressor.read(encoded_str) compressor.run_lengths = encoded_run_lengths print(compressor.decompress()) # prints "WWWWWWa111BB" Note that when compressing a file using run-length encoding *both* the run lengths and compressed data are saved to the file. When performing decompression, the run lengths are read in from the file, followed by the compressed data. This data is then passed to a compression object (i.e., in our case ``RLECompressor``) to perform decompression. WordCompressor ~~~~~~~~~~~~~~ The ``WordCompressor`` class is a compressor that contains an implementation of the repeated word algorithm. This class **must** inherit from the ``Compressor`` class and contain the following constructor: :: def __init__(self, delimiter): ''' Constructor a compressor with the given delimiter and sets the encoded dictionary to be initially empty. ''' The ``WordCompressor`` class **must** define a property called ``encoded_dict``. This property represents the encodings for the original words in the text data. For example, :: compressor = WordCompressor(":") compressor.read("If you fall I will catch you, I will be waiting") print(compressor.encoded_dict) # prints {} encoded_str = compressor.compress() print(encoded_str) # prints "If you fall :\x03 :\x04 catch you, :\x03 :\x04 be waiting\n" print(compressor.encoded_dict) # {':\x00': 'I', ':\x01': 'will'} Note you may not actually see the encodings (e.g. ``:\x03``) inside the text when using ``print``. The repeated word algorithm is implemented inside the ``compress`` and ``decompress`` methods: :: def compress(self): ''' Compresses the text buffer using the repeated word algorithm. Update the encoded_dict attribute with the encodings for the repeated words Outputs: A string of the encoded buffer text ''' def decompress(self): ''' Restores the uncompressed string (i.e, original data) from the encoded text data buffer. Use the encoded_dict property to retrieve the encodings for the original words. Output: The restored original data as a string ''' The following is an example of using the same ``WordCompressor`` object to compress text data and then uses the ``decompress`` method to restore the original data: :: # Perform compression compressor = WordCompressor(":") compressor.read("If you fall I will catch you, I will be waiting") encoded_str = compressor.compress() # "If you fall :\x03 :\x04 catch you, :\x03 :\x04 be waiting\n" encoded_dict = compressor.encoded_dict # {':\x00': 'I', ':\x01': 'will'} # Perform decompression compressor.clear() compressor.read(encoded_str) compressor.encoded_dict = encoded_dict print(compressor.decompress()) # prints "If you fall I will catch you, I will be waiting" **Testing Compressor, RLECompressor, and WordCompressor** We have provided test code for this task in ``test_compressors.py``. Task 2: File Archives --------------------- A file archiver is a program compresses a number of files together into one output file (i.e., an archive file). File archivers use a lossless data compression algorithm to reduce the size of the archive file. For this task, you will determine the best compressor to use when trying to generate an archived file. We have seeded your ``pa6`` directory with a file named ``archiver.py``. This file includes the ``ArchivedFile`` class that represents an uncompressed archived file. Your task is to implement ``find_best_compressor``: :: def find_best_compressor(archived_files, compressors): ''' Updates each compressor attribute inside the list of archived files to the compressor with the greatest compression ratio inside the list of compressors. The function does not update the compressor attribute if all the compressors inside the list of compressors produces a compression ratio greater than the compressor already assigned to the archived file. Inputs: archived_files ([ArchivedFile]) : a list of archived files compressors ([Compressors]): a list of compressors ''' This function does not return anything. The ``find_best_compressor`` searches through the ``ArchivedFile`` objects inside the ``archived_files`` list and reassigns the ``compressor`` attribute to a compressor inside the ``compressors`` list if the current ``compressor`` does not produce the greatest compression ratio. The compression of an ``ArchivedFile`` object is done by reading in and combining the text data of all files assigned to the ``files`` attribute. If two compressors produce the same compression ratio then choose the compressor that comes first inside the ``compressors`` lists. If the current ``compressor`` attributes is greater than or equal to one of the compressors inside the ``compressors`` list then *do not* modify the ``compressor`` attribute. **Testing Archiver** We have provided test code for this task in ``test_archiver.py``. Grading ------- Programming assignments will be graded according to a general rubric. Specifically, we will assign points for completeness, correctness, design, and style. (For more details on the categories, see our `PA Rubric page <../rubric.html>`__.) The exact weights for each category will vary from one assignment to another. For this assignment, the weights will be: * **Completeness:** 65% * **Correctness:** 10% * **Design:** 15% * **Style:** 10% Obtaining your test score ~~~~~~~~~~~~~~~~~~~~~~~~~ The completeness part of your score will be determined using automated tests. To get your score for the automated tests, simply run the following from the **Linux command-line**. (Remember to leave out the ``$`` prompt when you type the command.) :: $ py.test $ ../common/grader.py Notice that we're running ``py.test`` without the ``-k`` or ``-x`` options: we want it to run *all* the tests. If you're still failing some tests, and don't want to see the output from all the failed tests, you can add the ``--tb=no`` option when running ``py.test``:: $ py.test --tb=no $ ../common/grader.py Take into account that the ``grader.py`` program will look at the results of the last time you ran ``py.test`` so, if you make any changes to your code, you need to make sure to re-run ``py.test``. You can also just run ``py.test`` followed by the grader on one line by running this:: $ py.test --tb=no; ../common/grader.py Continuous Integration ---------------------- Continuous Integration (CI) is available for this assignment. For more details, please see our `Continuous Integration <../../ci/index.html>`__ page. We strongly encourage you to use CI in this and subsequent assignments. Submission ---------- To submit your assignment, make sure that you have: - put your name at the top of your files, - registered for the assignment using chisubmit, - added, committed, and pushed your code to the git server, and - run the chisubmit submission command. .. code:: $ chisubmit student assignment register pa6 $ git add compressor.py $ git commit -m "final version of PA #6 ready for submission" $ git push $ chisubmit student assignment submit pa6