This section is a continuation of Sequential Lookup and Dichotomy. This section is relatively self-contained, so if you only want to learn about hash tables and hash lookups, just read this section. If you want to have a systematic understanding of the searching algorithm, we recommend reading the previous section first.
The first three sections of this article are all basic introductions, there are no specific algorithm questions, if you have a certain understanding of this part of the knowledge, you can skip directly to the fourth part.
As we mentioned earlier, the time complexity of using a sequential lookup method to confirm the existence of an element in an unsorted array iso(n)
。It's basically similar to a brute force lookup, there is no special skill to speak of, and the specific time depends on luck, but on average, it is linear with the size of the array. And if the array is sorted, we can use a binary lookup with a time complexity ofo(log(n))
。But sorting itself comes at a cost. As we'll see in the next few sections, sorting is quite an expensive process. Therefore, it is very uneconomical to sort the array just for the purpose of making it easier to find it.
So is there a search method that does not require the array to be sorted well, but can also achieve a high search efficiency?
As the title suggests, this approach exists.
For the two lookups we talked about earlier, given an arbitrary value, such as an integer 4, it is possible to exist anywhere in the array. For example, for an array of length 8a[8]
,4 This value may not exist, or it may be stored ina[0]
a[7]
in between. This makes it difficult for us to find it: we don't know where it might be beforehand, so we need to use all sorts of stupid methods to enumerate, probe and narrow it down.
And this gives us a new idea: if we can establish a relationship between a value and where it might be stored, then when we want to find a value, we just need to glance at the possible locations. If not, then that element is not in the array. In this way, we can complete the lookup without having to iterate through the entire array.
Let's take the simplest example: let's say we have an array of size 5a[5]
Used to store integers, the array is empty at first.
At this point, a new value, 38, comes in, which needs to be stored. This time we didn't put it on a first-come, first-served basisa[0]
Instead, we agree on a rule in advance: the incoming value is stored at a position equal to the remainder of this number divided by 5. 。So we put 38 in
a[3]
A few more numbers came later, respectively, we follow the rule of taking the remainder of 5 and put them separately
a[1] a[0] a[4] a[2]
The storage task is complete.
What should I do if I want to find it?Let's say we want to see if 4 is in an array, we don't have to iterate through the entire array. We know that the position where each number is stored is equal to the remainder of its division by 5, so we just need to follow this rulea[4]
Just take a look there. Yes is yes, no is no. Time complexity, mind you, iso(1)
!This means that, in an ideal world, no matter how big the array is, we can just knock it out because we know the rules of storageaThe corresponding door, just open this door we can judge the result. The time loss of this process is constant and independent of array size.
This is the simplest hash table. among othersx%5
It's its hash function, and through the hash function, we map each incoming value to where it should go. This makes it very convenient for us to find it: now that we know where it is going, we just need to go to this location and take a look.
The above is an ideal rudimentary hash table. You've probably found a lot of problems:
If two numbers come in, the remainder divided by 5 is equal, for examplewith
How to allocate rooms at this time?
Is dividing by 5 a whim, and why not divide by 6?
Well, convenience doesn't come without a price. One was choseno(1)
way, it is necessary to pack and bear all the problems it brings.
The biggest problem with the hash functions described above is that they cause collisions. When the remainder of two numbers to 5 is equal, they compete for the same position. Therefore, how to map each incoming number to a unique location is a big problem in designing a hash function.
One possible solution would be to make the remainder divisor very large, so that conflicts can be avoided to some extent. Suppose that if the divisor is set to 10000000, then the whole number within 10000000 can get a unique value by taking the remainder of it, and they will all be assigned to different rooms, so there will be no conflict.
But at the cost, you have to prepare 10,000,000 rooms accordingly. This is a huge challenge for memory. Unless we have a lot of elements, you can imagine that this memory utilization is extremely low.
Another way is to disassemble a large number, for example, if we want to store a ** number 135-234-232-10. We can take it apart in a trinity and sum it up:。In this case, if the array size is 12, then the remainder is 11.
Another commonly used method is the so-called mid-square method. Assuming the number we want to store is 35, find it squared, take the middle two digits to get 22, and then take the remainder according to the size of the array.
If we are not storing an integer, but rather a character, then we can also get the value of the value by getting the ASCII code and then design the hash function.
In conclusion, there are many common ways to design hash functions, but there is no systematic one-size-fits-all approach. The main points here are:
Try to avoid conflicts, try to improve the utilization of space, and design different hash functions according to the nature of the storage objectFor example, if the numbers we store are all multiples of 5 (such as the score of a tennis game, then it is not appropriate to use the remainder) the design of some hash functions can be referred to here. Basically, no matter how cleverly designed the hash function is, conflicts are always inevitable to some extent. So, what if there is a conflict?If the array is full, and there are people in each position, then of course there is no other way. This is true even for a general array, so I won't discuss it here. The case to consider is that if there are still empty spaces in the array, then the natural choice is to find a suboptimal position for the new element.
The question is how to find it
We useempty
to indicate that there is no one in an empty slot of the array. Suppose a surplus of 7 is taken, and an array currently looks like this:
a[7] = [35, empty, empty, 3, 11, 19, 34];
If we were to insert a new element, 14, since the remainder is 0, it would be supposed to be put in the first place, but since 35 is already there. We're going to find another location.
One way is to store nearby, if there is already someone in the location with a remainder of 0, save it to the place with a remainder of 1, if it still doesn't work, save it to the next location, and so on. This method is called linear probing.
The advantage of this approach is that it is simple, but the potential problem is that if there are many conflicting values in a certain position, all elements will be postponed except for the one that occupies first. Note that this linear probing is ***. Like the example above, 7 is saveda[1]
, which also makes those numbers with a remainder of 1 have to find another position, which can have a chain effect: ina[0]
In the vicinity of this block, all the numbers are not where they should be.
One way to do this is to move a little bit more than one position at a time to make sure the entire array is evenly distributed. This way, it doesn't get tangled up in one corner (clustering). So, instead of adding one, we can also add three, add five. This process is expressed in the following ways:
new_position = (original_positon + offset) %array_size
offset
It's the value we add each time, and if it doesn't work once, we add it again until we find an empty space. It is important to note here that we need to make sure that every position in the array has a chance to be accessed, otherwise some empty slots will never be used, which will actually reduce the utilization of space. A simple way to do this is to make sure that the size of the array is a prime number (prime number).
Another way is to set a fixed one with itoffset
We can change this value, for example, at the beginning by adding one to find a vacant position, if it doesn't work, add 4, if it doesn't work, add 9, 16, 25...
Another way to deal with conflicts is chain addressing. This means that instead of storing a specific value, the original array puts a map (pointer) in each room that points to the set of elements that result in the same remainder. This sounds complicated, but let's illustrate it with the following diagram (divide by 5):
The first line is used to store the pointer, and if a corresponding element comes, put it into the corresponding chain. The advantage of this approach is that we can know for sure which chain an element is in, for example, unlike heuristics, we don't have to worry about numbers with a remainder of 2 running to 3. However, the shortcomings are also conceivable, as the chain length increases, the difficulty of finding also increases, and the storage of the map itself also occupies a certain amount of memory.
So which of these two methods is better?This issue will not be discussed here, if you are interested, you can refer to here. Example 1A startup has always adhered to the concept of small but beautiful, and the number of employees has been kept to less than 11. Design a data type to store each employee's work number and salary, and require that when a certain employee number is given, it can be used in theo(1)
(or return: the employee number does not exist). The format of the job number is:yyyymmdd
, the date of entry, such as, wages are measured in dollars, such as:
。The specific required implementation classes are as follows:
class employee info: def init (self,size): Define the size of the **, and two items: the work number and the salary selfsize = size self.id = [none]*self.size self.salary = [none]*self.size def set(self,id,salary): This function is used to enter data, enter the employee number and salary, and save it in the data table def get(self,id): This function returns the salary of the employee with the employee ID def hash function(self,id): Define the hash function def collision resolve(self, origin, offset): Define the conflict resolution, whichever approach can be used here
(High-energy warning ahead: The following is the reference *** It is recommended that you write one yourself).
class employee_info: def __init__(self,size): self.size = size self.id = [none]*self.size self.salary = [none]*self.size def set(self,id,salary): Find the hash value corresponding to the job number, and the remainder method used here is the employee number %selfsize hash_value = self.hash function(id) If there is no one in the corresponding room, it's very simple, just assign a value to the corresponding position if selfid[hash_value] == none: self.id[hash_value] = id self.salary[hash value] = salary If there are people in the room, conflict resolution should be carried out else: time = 1 This is used to test the address, until there is a vacancy while selfid[hash_value] != none: Quadratic probing offset = time**2 is used to determine the next tentative room through the conflict resolution function hash value = selfcollision resolve(hash value,offset) time += 1 to selfid[hash_value] = id self.salary[hash value] = salary def get(self,id): Or find the corresponding hash value = selfhash_function(id) if self.id[hash_value] == none: return 'The job number does not exist!'Otherwise, start looking for it, if it goes well, find it once, otherwise you have to follow the possible position of quadratic probing until the value of the work number matches time = 1 while selfid[hash_value] != id: offset = time*2 hash_value = self.collision resolve(hash value,offset) time += 1 is found, return the corresponding salary return selfsalary[hash value] def hash function(self, key): remainder return key%self.size def collision resolve(self, origin, offset): Conflict resolution return (origin+offset)%self.size
According to the above note, it should not be difficult to understand the various functions, basically, the difficulty is in the input data, if the hash value corresponds to the oneid[hash_value]
Inside stillnone
, it is relatively simple, directly assign, otherwise you have to find one according to the addressing methodnone
Until.
We enter a set of data to perform a test:
t = employee_info(11)t.set(20131225,180000)t.set(20140825,90000)t.set(20141110,112300)t.set(20141121,112300)print t.idprint t.salaryprint t.get(20131225)print t.get(20111220)
The outputs are:
[20141110, 20140825, none, none, 20131225, 20141121, none, none, none, none, none, none, none][112300, 90000, none, none, 180000, 112300, none, none, none, none, none] 180000 does not exist
Worth noting here is the fourth set of data we entered
, the remainder of the work number to 11 is equal to 0, which should be placed first, but the front has been
Occupied, so according to quadratic probing, it can only find the next one, but the next one is also
occupy, so it can only be setoffset=2*2=4
and run straight toid[5]
This time there was no one, so the data was stored there.
The above implementation is a minimized class for illustrative purposes only. Even if it's just for Xi practice, there are still some basic things to consider.
XiIn the ** above, suppose someone raises their salary, and the salary of the employee with job number 20140824 rises from 96,000 to 112,200, how do we update it at this time?with existing onesset
Does the function solve the problem?If not, modifyset
function makes it possible to meet this need. Or add one moreupdate
Function.
For example, if someone leaves the company and the data is naturally deleted, then should one be defined?del
Functions like that?How does this work?
Also, if we define a table of 11 and we enter 12 pieces of data, what happens at this time?How to modify the ** above to ensure that the entered data is not more than the size of the data table?
These questions are left for everyone, and if you write the corresponding ** (no matter what language it is), please remember to share it with me!
Python comes with itselfdict
This data type can be used handy to store data such as key-value. In the example below, we'll use it. Of course, you can also use it instead of it, but slightly change the hash table you wrote in the previous section, and you can also use it to solve the following problems. dict
The usage is very simple, you can see here for details. Example 2Given an array and a target value, find the two numbers from the array so that their sum equals the target value, returning the subscript of the two numbers in the array (from the beginning).
Like, givenwith
, then two subscripts 1 and 3 are returned, note that the subscripts here start from 1.
This is question 1 of leetcode and is of medium difficulty. The assumption here is that the answer must exist and is unique. A naturally stupid way to do this is that every time we poll for a number, we iterate through the other elements of the array to see if there are any elements that can be matched. But in this case, time complexity can be reached in the worst-case scenarioo(n^2)
。A better approach would be to harness the power of hash tables and reduce complexity by quickly finding the presence or absence of an element. So a possible flow would be like this:1Use a hash table to store the elements and subscripts of the array; 2.Iterate through the newly created hash table, starting with the first element to see if there is another element that can be matched, until you find it. Since we have hash tables, the second step of the process is at worst time complexo(n)
Thinking about it further, the first step and the second step can actually be combined: each time an element comes in, we first see if the hash table already has a matching element, and if so, we can directly return the result. If not, store the new element in a hash table. If you think of it this way, it could be something like this:
num is the array, target is the target value def two sum(num, target): Create a new python dict data type d = {}for i in range(len(num)):if target-num[i] in d: If the matching value is already in the hash table, then just return two subscripts return i+1, d[target-num[i]] else: Take the numeric value as the key, and the following is marked as the value, so that we can quickly find d[num[i]] = i + 1 by the numeric value
Example 3Given a string of strings, find the longest substring length without any repetitive characters. For example,abcabcbb
, 3 is returned.
This is question 3 of leetcode and is of medium difficulty. It is not difficult to see that this is another one that we need to judge frequentlyif a is in b
The title. In general, hash tables are likely to come in handy in this case. The basic idea is that since it's a substring, we need to have two pointers (or labeled variables) to identify the start and end of the substring. The basic process is that we iterate through the entire string, and if we encounter new characters and other duplicates in the current substring, we stop and look at the current string length, compared to the current longest length, if it is longer than the longest, we update it, otherwise we continue to go down.
def lengthoflongestsubstring(self, s): if s == '': If it's an empty string, don't bother return 0 slist = list(s) Define a hash table to record the closest position where a character appears dict = {} Record the longest length max len = 0 Record the start subscript for each substring start = 0 for i in range(len(slist)): Iterate through all strings if slist[i] in dict: The current character has appeared at least once. But if it's not in the current string, don't care, if it's in the current string Compare the length of the current substring with max len which length is if(dict[slist[i]] = start): max len = max(max len, i-start) Reset the start of the new substring Note that it can't be set to i+1 here, but should be dict[slist[i]]+1, why?start = dict[slist[i]]+1 if it doesn't exist, record dict[slist[i]] = i max len = max(max len,len(slist)-start) return max len
One of the difficulties here is how to reset the position of start when a duplicate character is found. Consider this example:abcdefgfhijkrtewop
, it's not hard to see that the longest substring isgfhijkrtewop
, but note that the first time we found the duplicate element was in the secondf
, and at this time found a duplicatef
What we should do is set start to the previous onef
behind, not the current onef
of the back. This should be noted.