Alphabetic String Calculation (Lintcode)
Problem
In the world of the moon, ordinary people will solve a problem. This problem is a string of English letters. Each time you can delete an English letter. Try to delete as few times as possible to make the various letters’ frequency appear differently. The number of deletion is the answer of the calculation that needs to be returned.
Thinking about the problem
Quite intuitively, we should straight away think of histogram of the letters and their frequency. The question gave the example of the string “aaabbb”. Below diagram put them in histogram form. Immediately, we could understand that to ensure the two letters “a” and “b” here to have different frequencies, we could either delete one “a” or one “b”. Hence, the answer of the least number of deletion is one.
Corner case: Zero Frequency
In the next example “abcd”, you are told that you could delete each of the three letters “a”, “b”, “c” once. You should be able to sense that to keep to the spirit of the question, zero frequencies of letters in the alphabetic string are allowed. Why (you may ask)? Well, at zero frequency, the letter does not appear. So this is the corner case to note.
Comparison of Letters and Their Frequencies
Preparatory phase: To delete the letters, we need to compare their frequencies. We do so by first lining up the characters from highest frequencies to lower ones. A good data structure to “line” up the frequencies is the priority queue. The frequencies of each of the characters are first fed into the priority queue. To get the frequencies of each character, we first go through each characters of the alphabetic string and store their frequencies in a hash-map/dictionary.
Step 1: Then we take the values of the dictionary representing the frequency and insert them into the priority queue. In Python, you can use the “heapq” queue which is implemented as a minimum priority queue.
Step 2: Next, we pop the highest frequency from the priority queue and call this the reference frequency. Then we pop another frequency and call it the current frequency. We compare the two frequencies.
Step 3A: If the current frequency is not equal to the reference, we don’t have to delete the frequency of the current frequency. Instead, we replace the reference frequency with this current frequency. Then we pop the next frequency from the priority queue and refer to it as the new current frequency and compare the frequencies (repeat steps 3A or 3B).
Step 3B: If however the current frequency is equal to the reference frequency, we delete the “character” by subtracting the current frequency by 1. Record and accumulate the deletion count. If the adjusted current frequency does not decrease to zero, we push this adjusted current frequency back into the priority queue, else we are done with this adjusted current frequency. Repeat Step 2.
Further Notes: (1) By pushing the non-zero adjusted current frequency back into the priority queue, we are allowing further “deletion, if necessary” to take place in the aforementioned algorithm steps. (2) Alphabet characters of zeroed adjusted current frequency already doesn’t “appear”, so we can ignore them.
To aid visualisation and understanding, you can see the example of one more alphabetic string “aaabbccd”.
Python Code for Reference