Wednesday 13 September 2017

Binary Search Tree Algorithms


Just developed a reasonably well-commented code for algorithms involving Binary Search Trees. You can find it on Github with the following link:





It includes basic algorithms like insert, search and delete. Also including slightly advanced algorithms like height-balancing the tree in O(n) and post-order, pre-order and in-order traversals. Since some interview questions include slightly funky questions like "bottom-top view of BST", I am planning to include few of those as well. As always, all code is documented and the time and space complexity are included wherever possible. Any doubts/comments/questions/suggestions, please feel free to reach out in the comment section.

Sunday 8 May 2016

Translate Numbers to Their English Phrases


Problem Statement:
                    Write a program that takes an integer between 0-99,999 as input and output its English phrase.
For example :

             Input: 77890
             Output: Seventy Seven Thousand, Eight Hundred, and Ninety

Breaking down the problem:
              Such programs are commonly used in NLP to communicate with a user. They rarely use complex data structures or the programs are rarely complicated by themselves. They mostly are a chain of if-else conditions that handle particular cases. However, the sharpness of the programmer is tested here in the sense that he/she should handle all the possible inputs.

              If one observes above output, the commas and the word "and" are not associated with any digit in the number. However, according to the problem statement, they are very important. Such elements of the program, which need to be added for the sake of making the output "look better" or "closer to English" are often the main hurdles.
             
              Hence, there are 2 main sub-problems here. The first is to interpret the digits and convert them into their English words (taking care of their place in the number like the unnits place, tens place, etc.) and second is to take into account all the commas and the word "and" along with their placement.

Approach:
              The main hint here is the range of inputs - from 0 to 99,999. If one observes carefully, the last two digits and the first two digits always make one phrase in English. In the above input, the last digits "90" form Ninety and the first two digits "77" form Seventy Seven (along with thousand - their place value, but thats easy to predict). Hence, these will be considered in pairs. The third digit i.e. the digit in the hundred's place, is always alone. There is always a "Seven Hundred" or a "Three Hundred". There is never a "Seventy Three Hundred". Hence, logically, this is to be considered alone.

               Our first step should be to teach our program the English words, starting with the basics. The program needs to know that 0 becomes "Zero" and 1 becomes "One" and so on. Additionally, it also needs to know that 2 in tens place (or in the higher position whenever pairs are considered by the above logic) becomes "Twenty", 3 in similar position becomes "Thirty" and so on.

                English is a funny language. It is widely regarded as the toughest language for a program to learn, because of its sudden and seemingly illogical variations. For example, 11 has to become "Eleven" and not "Onety One". There is no reason why it cannot be "Onety One" for the sake of symmetry. But, such is life, and hence our program has to now learn about the phrases for the numbers from 11-19.

Trivia: The oldest Indian language, Sanskrit, is widely regarded as the easiest language for a program to learn owing to its symmetry.

                The next step is to implement the above approach. This is easy enough to do with a nested if-else tree. However, identifying the boundary conditions and all the cases and accomodating them was a pain to the author himself. For example, it is easy to overlook a small flaw in the above approach. If we take digits in pairs and check for words for them in an array, it becomes difficult for numbers with 0 as the lower number. For example, 77 becomes "Seventy Seven" by 70 is just "Seventy" and not "Seventy Zero". Hence, we need to make the 0th index of our array "unitsPlace" as an empty string. However, another issue now arises - how to handle if the input itself is just 0? This can be handles by adding yet another condition which checks if the input is 0, if it is - it simply prints "Zero" and exits. Any other 0s that might come will be translated to the null string.

                  The final step for ", and" is added to check if a phrase is not left hanging with those words as its last. There needs to be some word after "and" or else the program has some error.

Sunday 1 May 2016

Cosine Similarity


Description:
              Cosine similarity is a metric to get an idea of how close any 2 entities are based on the difference between their corresponding dimensions. It is intuitively opposite to a distance calculation metric like, say, Euclidean Distance or Manhattan Distance. While such metrics calculate how different two vectors are, cosine similarity measures how similar two vectors are. Entities with high similarity will yield high results when plugged in to the cosine similarity function.

              The important feature that makes cosine similarity so popular inspite of being a crude, basic form of similarity measure is that it is independent of the number of dimensions present. If the vectors are two-dimensional, say x and y, cosine similarity can be determined over those two dimensions. If the vectors have more than two-dimensions, cosine similarity still works exactly the same way, just traversing over each dimension. Moreover, cosine similarity will work over vectors with different number of dimensions as well. For example, if one vector has x and y as its two dimensions, and other has only x, cosine similarity considers y to be 0 for the latter and works normally. This is practically correct, since any n-dimensional vector will not be present in the (n+1)th dimension and hence will have a value 0 in the (n+1)th dimension.

The Formula:

                         \text{similarity} = \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} = \frac{ \sum\limits_{i=1}^{n}{A_i  B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}}  \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }
                                                     Image courtesy: Wikipedia

Where Ai and Bi are the dimensional components of A and B respectively

Range and Analysis:
                     The range of cosine similarity, like that of cosine function, is [-1, +1] inclusive. A value of -1 means that the two entities are completely opposite of one another. A value of +1 means they are exactly similar. A value of 0 means that they are "perpendicular" to one another i.e. they exist in different planes altogether.


Uses:
                  Cosine similarity is widely used in data mining techniques to determine the similarity between two vectors. I recently used this metric in a machine learning project to calculate similarity between users. Each user had Age, Latitude and Longitude of their address as their component dimensions. This was done so that we can predict book ratings for users. For example, if user U has rated a book B as 5, and user V is similar to U, then user V will also rate the book B as 5. For interested users, I also employed uncertainty principle. This is just a modification to the above case. using uncertainty, if user V is very similar to user U (who has given B 5 rating) and very very similar to, say user W (who has given B 8 rating), then we can say that it is more uncertain that V will give a rating around 5 than around 8. This also means that V is more probable to give a rating close to 8 than 5.

Conclusion:
                      Other similarity metrics like Jaccard index and Tanimoto score also exist, but they are mostly rooted in set theory. Meaning, they are designed to deal with sets more than vectors or raw data. Arguably, Jaccard and Tanimoto can be applied to any vector as well, but the inherent operations that need to be performed are intersection and union (AND and OR for Tanimoto). Hence, we recommend cosine similarity (either original or modified as per the problem) for raw vector data.

Friday 15 April 2016

Triplets of Ones

Problem Statement:

There is a String s which has a few (maybe none) '1's in it. Get all triplets of such '1's which are equidistant to each other. Output the triplet with the maximum distance 'k' between them.

Solution:

We first need to find the maximum value of distance 'k' between triplets of '1's present in the String. Obviously, the maximum possible value of k will be half the string length. So we will iterate over the value of k from s.length()/2 down to 1. We then iterate over the string to check for '1's and checking whether the characters at distances of k and 2k are also '1's.

Clearly, there are 2 variables whose iterations will determine the time complexity of the problem viz. 'k' and the string itself which needs to be traversed. Thus, one solution could be to decide on a value of 'k' in the outer loop and then traverse the string to see if this 'k' contains triplets of one.

Algorithm1{
              for k = string.length/2 down to 1 :
                   for i = 0 up to string.length-2k :
                       if string[i]=string[i+k]=string[i+2k]=1 :
                             output k;

}

Another solution could be to decide on an index in the string and then traverse over the string using different values of 'k'.

Algorithm2{

              for i = 0 upto string.length-2 :
                     for k = string.length/2 down to 1 :
                          if i+2k < string.length AND string[i]=string[i+k]=string[i+2k]=1 :
                               maxK = k; break;
              output maxK;
}

Both the solutions are asymptotically equally fast and have a time complexity of O(n2) in the average case.

However, if one observes carefully, the former method might yield faster results in cases where the answer is closer to the length of the string. This is because 'k' is decremented steadily in the former algorithm and is oputputted as soon as the condition is satisfied. This results in lesser conditional checks and in turn lesser array elements accesses leading to lesser running time.This fact, however, will not be reflected in the asymptotic performance of the algorithm, since asymptotically, it will remain O(n2) owing to the nested for loops.

Below is the Java code for the former algorithm:
 

Thursday 31 March 2016

Mapping in Java

Mapping:
                 The basic idea behind mapping is to establish a relation between two or more seemingly unrelated set of data structures. A common example of mapping is when a word string is mapped to its integer equivalent based on the ASCII values of it's characters. At a first glance, it might make little sense to develop a relationship between a string and an integer. However, mapping is very useful in programming and is, in fact, very common. 
                The main attraction to mapping is the reduced search time. While a linear search requires O(n) in the average case, search in mapping takes O(n) in worst case, only if the hashing function is not carefully chosen.


A mapping overview - buckets (ranges/numbers) mapped to keys (Strings)
(Image courtesy - Wikipedia)

Types of Mapping and Data Structures in Java:

1) One-to-one Map - java.util.hashmap:
                    HashMap is the most commonly used data structure that maps an object with a key. Both the object and the key can be of any data type, provided that the data type extends Object. Thus, primitive data types like int, float, double, etc. are not used as object and keys. Their corresponding wrapper classes like Integer, Float, Double, etc., however, can be used since they all extend Object.
                    HashMap maps any object using a key. Both the object and the key can be determined by the programmer, provided that the key is unique. Every key in the HashMap maps to one object only. Thus, this is a one-to-one mapping kind of data structure.
                     Already hashed objects can be retrieved from a key by the Map.get() method. Map.put() assigns an object to a key.

2) One-to-one bidirectional Map - org.apache.commons.collections.BidiMap:
                      One shortcoming of java.util.HashMap is that one cannot retrieve a key from its value. This is logical in the sense that the value is derived from the key, and hence the value should depend on the key and not the other way around. If, we try to search a key from a value, we may need to traverse the entire HashMap, which takes O(n) and defeats the purpose of a HashMap.
                       However, there may be times when we need to establish a bidirectional relationship between 2 objects whilst preserving the fast searching of a HashMap. Fortunately, BidiMap becomes useful in such cases. Note that in the case of BidiMap, the words "key" and "value" do not hold their semantic value. Since, both depend on each other, neither is a key. Both are simply values, strictly speaking.
                       BidiMap gives a convenient function getKey() which returns the key for the value argument. Both the key and value can be anything that extends the Object class. Again, "key" and "values" should not be taken literally.


3) One-to-many Map using java.util.hashmap:
                         Although generally used for one-to-one mapping, we can implement one-to-many mapping using java.util.hashmap. As mentioned earlier, the value can be anything that extends Object, we can use a List, or an ArrayList, or any other Collection for that matter as value, thus making the map one-to-many.
                          One key will thus map to one list of values. These values may now be arranged in any order as possible. If there is a specific order in the values, then an ArrayList is preferred, since it provides arbitrary access via indexes. In any case, the primary point of consideration here should be that the values must be arranged in a way that the main advantage of using a Map i.e. O(1) average search time should not be compromised to a large extent.

Traversing through a Map in Java:
                           There are a few ways to traverse through a map in Java, but the one we suggest is using an Iterator via a Map.Entry of the map. After assigning an iterator to a map, it traverses through the map via its next() method. A Map.Entry will create an Entry object from which the key and the value of the Entry can be easily obtained via the getKey() and getValue() methods. Both HashMap and BiDiMap can be traversed in the same way.

 

Friday 11 December 2015

The N Transmitters Problem

The Problem:
                 There is a one-dimensional road of infinite length, which extends from -∞ to +∞. There are "n' one-dimensional transistors placed on the road at various points. Each transistor emits a signal. For every transistor, the signal can be transmitted across a region "r" which may differ from transmitter to transmitter. Since the road is one-dimensional, by radius "r", we mean that the signal can be received up to "r" distance to its left and "r" distance to its right.
                 There is a group of one-dimensional beings that wish to stay on this one-dimensional road. A region on this road is said to be habitable if and only if it can get at least "k" signals from "k" different transmitters. Given the road and the transmitters and their signal radius and the value of "k", we need to find out all the regions that can be considered habitable.

Assumptions: 
                 1) There is no concept of signal strength in our world. A signal will have full strength up to its radius "r" and beyond that, it will have zero strength.
                 2) We have considered, without the loss of generality in the algorithm, the coordinates and the ranges of the transmitters to be purely integers. There will be no change in the algorithm in case we take them as floats, or doubles.
                 3) There will not be more than one transmitter on a point. If that is the case, the transmitter whose signal can travel farthest will be considered.

Brute Force Approach:
                The brute force approach would be to keep a set of each n transmitters and the regions to which each can provide signals. In this case, the required answer would be the intersection of any k sets of the n total sets just formed. The time complexity of this approach is O(nCk)

O(nLogn) Approach:

                 Algorithm Axiom: Solutions become much more efficient when the data is arranged in some particular order.
                Imagine us being a part of the one-dimensional world mentioned above. We have an infinite road in front of us. Beginning our search from somewhere in the middle is difficult because we do not know about the transmitters to our left and right. So we can make no sound decision about which direction to start looking for habitable regions in, If however, say, someone tells us that the region we are currently in is the left-most end of the range of the left-most transmitter, we know for a fact that there can be no habitable regions to our left. This makes our next step clear - to go to the right in the search of habitable regions.
                 Coming back to our algorithm, if we sort the input transmitters in the non-decreasing order of their coordinates. If we start from the first transmitter, we will traverse the list of transmitters in one direction, along with keeping a track of whether the region we are currently in habitable or not. The only question that now remains is how to keep track of habitable regions. It is impossible to keep a count of all the signals available at every point on the road, as there are infinite points.
                  Point to be noted: All habitable regions will begin from a transmitter's left-most signal limit and end at (possible another) transmitter's right-most signal limit.
                  Our task is simplified if we take into consideration the above point. To identify a region, we only require its start and end points, and these points will only be the let and right limits of some transmitters. Hence, instead of analyzing all points, we analyze only those points where a transmitters range ends or begins. There are maximum "2n" such points for "n" transmitters.

Algorithm NTransmitters (transmittersList, Integer k):
1) sortAscending(transmittersList) on their coordinates;

2) beginsEnds := list of all the left-limits and right-limits of all the transmitter signals;
3) Traverse only begin and end lists from beginsEnds[0]; //beginsEnds[0] will hold the left-most end of                                                                                           the road where a signal is being received. The                                                                                           traversal will not include all the points, just the                                                                                          "begin" and "end" points of the list.
4) If beginsEnds[i] = "begin":
               count := count+1
    Else:
               count := count-1   
5) If count >= k :
               output region as habitable; 

Following is a well-commented Java code for the above algorithm. The only other noteworthy thing is that we have maintained a TreeMap instead of the usual HashMap. TreeMap extends HashMap, but sorts the data on their key value. This is needed so that we start from the left-most begin point. (it will be the one with the least coordinate value). Also, we have maintained 2 separate lists of begin and end to optimize memory requirements, As always, anyone is free to ask any doubts or share any comments in the comments section.