In programming, often we find ourselves working with large sets of data. A lot of this work is reliant on matching up data, and searching the data for specific values. A good understanding of different search techniques is essential in order to minimize processing time.
To start, let’s consider a small data set, which we will call testSet. I will define the data set as follows:
Suppose I want to search this list for a specific value. One strategy might be to do the following:
We call this a linear search. This search will iterate the list one element at a time, looking for the value that is being searched for. If it is found, the index of the element is returned. If it is not found, -1 is returned to indicate the value is not in the list.
Breaking down this algorithm further, we can see that in the worst case we need to look at every single element in the list. So, if our list has n elements in it, we need to search n times, meaning the algorithm efficiency is O(n).
To see how we can do better than this, we introduce the concept of a binary search. In order for a binary search to work, the list must be sorted before we search it. The binary search algorithm looks like this:
To illustrate how this algorithm works, we can look at an example. Suppose we want to find the number 2 in the list below. We start at the middle of the list:
First, we check if the number at the middle is the one we are looking for. It isn’t so we need to decide where we want to search next. Since we started in the middle, and the list is sorted, we know that everything on the left will be less than 4, and everything on the right will be greater than 4. We are looking for 2, so it must be on the left of 4. Knowing this, we can take all the values to the left of 4, and use that as our new list.
With this new list, we go to the middle and repeat the same logic. Of course, the middle element is currently 2, so we found what we are looking for. If it wasn’t we would just apply the same logic as before to make a new list and search again.
As you can see from this small example, it took us less comparisons to find the element compared to the linear search algorithm. To figure out the efficiency, we need to do a little bit of math. Let’s look at our example above, and see how the list size changes on each iteration.
We start with 7 elements. Each time we iterate, we halve the size of the list. In general, for a list of size n, each iteration k will give us a list size of:
The worst case would be where we get down to a single element remaining in the list. If we set the equation equal to 1, we can solve for the worst case efficiency.
So, our efficiency is O(log(n)). If we graph the efficiency of linear search compared to binary search, we will see that binary search always outperforms linear search
A great follow up question is to ask how things compare if we needed to sort the list first. Let’s suppose we use a merge sort to sort our list, which has a worst case of O(n*log(n)). This would give us a total efficiency of
n*log(n)+ log(n). Graphing this, we can see that linear search actual beats binary search.
So then when is binary search a better option than linear search? Of course, if the list comes sorted binary search is always better, but we can also consider cases where we need to search more than once.
If we search the list twice, for instance, linear search takes O(n + n) = O(2*n) time. Binary search takes O(n*log(n) + 2*log(n)) time. Comparing these graphs, we see binary search comes out more efficient.
The key takeaway from this is that multiple searches makes sorting the list and using a binary search worthwhile. To conclude, an understanding of different search techniques can make a large impact on the efficiency of code. Looking at the growth of the graphs, the difference between the two algorithms becomes larger as the data set size increases. It is important to make the best choices for searching in order to optimize efficiency.