# Презентация на тему: ADVANCED ALGORITHMS & DATA STRUCTURES                                       1/39
Средняя оценка: 4.4/5 (всего оценок: 11)
Код скопирован в буфер обмена
1

## Первый слайд презентации: ADVANCED ALGORITHMS & DATA STRUCTURES

Lecturer: Karimzhan Nurlan Berlibekuly nkarimzhan@gmail.com Lecture-13 Tries String algorithms Suffix trees

2

## Слайд 2: Tries

Tries are an extremely special and useful data-structure that are based on the  prefix of a string. They are used to represent the “Re trie val” of data and thus the name Trie.

3

## Слайд 3: Tries

Prefix : What is prefix: The prefix of a string is nothing but any  n  letters  n≤|S|  that can be considered beginning strictly from the starting of a string. For example, the word “ abacaba ” has the following prefixes: a ab aba abac abaca abacab

4

## Слайд 4: Tries

A Trie is a special data structure used to store strings that can be visualized like a graph. It consists of nodes and edges. Each node consists of at max 26 children and edges connect each parent node to its children. These 26 pointers are nothing but pointers for each of the 26 letters of the English alphabet A separate edge is maintained for every edge. Strings are stored in a top to bottom manner on the basis of their prefix in a trie. All prefixes of length 1 are stored at until level 1, all prefixes of length 2 are sorted at until level 2 and so on.

5

## Слайд 5: Tries

For example, consider the following diagram :

6

## Слайд 6: Tries

The pseudo code for insertion of a string into a tire would look as follows: void insert ( String s) { for ( every char in string s) { if ( child node belonging to current char is null ) { child node = new Node (); } current_node = child_node ; } }

7

## Слайд 7: Tries

The pseudo code to check whether a single word exists in a dictionary of words or not is as follows: boolean check(String s) { for (every char in String s) { if (child node is null ) { return false ; } } return true ; }

8

## Слайд 8: Tries

// C++ implementation of search and insert // operations on Trie #include <bits/ stdc ++.h> using namespace std ; const int ALPHABET_SIZE = 26; // trie node struct TrieNode { struct TrieNode *children[ALPHABET_SIZE]; // isEndOfWord is true if the node represents // end of a word bool isEndOfWord ; };

9

## Слайд 9: Tries

// Returns new trie node (initialized to NULLs) struct TrieNode * getNode (void) { struct TrieNode * pNode = new TrieNode ; pNode -> isEndOfWord = false; for ( int i = 0; i < ALPHABET_SIZE; i ++) pNode ->children[ i ] = NULL; return pNode ; }

10

## Слайд 10: Tries

// If not present, inserts key into trie // If the key is prefix of trie node, just // marks leaf node void insert( struct TrieNode *root, string key) { struct TrieNode * pCrawl = root; for ( int i = 0; i < key.length (); i ++) { int index = key[ i ] - 'a'; if (! pCrawl ->children[index]) pCrawl ->children[index] = getNode (); pCrawl = pCrawl ->children[index]; } // mark last node as leaf pCrawl -> isEndOfWord = true; }

11

## Слайд 11: Tries

// Returns true if key presents in trie, else // false bool search( struct TrieNode *root, string key) { struct TrieNode * pCrawl = root; for ( int i = 0; i < key.length (); i ++) { int index = key[ i ] - 'a'; if (! pCrawl ->children[index]) return false; pCrawl = pCrawl ->children[index]; } return ( pCrawl != NULL && pCrawl -> isEndOfWord ); }

12

## Слайд 12: Tries

// Driver int main() { // Input keys (use only 'a' through 'z' // and lower case) string keys[] = {"the", "a", "there", "answer", "any", "by", "bye", "their" }; int n = sizeof (keys)/ sizeof (keys); struct TrieNode *root = getNode (); // Construct trie for ( int i = 0; i < n; i ++) insert(root, keys[ i ]); // Search for different keys search(root, "the")? cout << "Yes\n" : cout << "No\n"; search(root, "these")? cout << "Yes\n" : cout << "No\n"; return 0; } Output : the --- Present in trie these --- Not present in trie their --- Present in trie thaw --- Not present in trie

13

## Слайд 13: String Algorithms

Basics of String Manipulation

14

## Слайд 14: String Algorithms

Basics of String Manipulation A string is a sequence of characters. In other words, a string is an array of character data type. An instance of a string is called a string literal. For instance in C++: string s = " HackerEarth "; s is a string literal. String Manipulation is a class of problems where a user is asked to process a given string and use / change its data. An example question would be a great way to understand the problems that are usually classified under this category. Given a string S of length N, shift each character of the string by K positions to the right, where K <= N. For example : Say S = " hacker " and K = 2. Here N = 6. Shifting each character in S by 2 positions to the right would result into erhack. Note that S i.e. 'h' is moved by 2 positions to the S. Also, S i.e. 'r', which is the last character in S comes round-about back to S as there is no space for 'r' to go beyond the limits of string length.

15

## Слайд 15: String Algorithms

Basics of String Manipulation Approach: Declare another auxillary string  shiftedS  that is of the same size as S. Copy  ith  element of S to the ( K+i ) th  position in  shifted_S. This means,  shifted_S [ i+K ]=S[ i ] where 0≤i<N. Make sure that  i+K  never exceeds N, because that will try to access a memory location which is not declared in  shifted_S. There's a simple trick to ensure that - use ( i+K ) mod N  .

16

## Слайд 16: String Algorithms

Basics of String Manipulation void shiftByK (char S[], char shiftedS [], int N, int K) { // Iterate through the length of given string for( int i =0; i <N; i ++) { // Find the index for this current character in shiftedS [] int idx = ( i+K ) % N; // Copy that character at the found index idx shiftedS [ idx ] = S[ i ]; } // Add a NULL character to mark the end of string shiftedS [N] = '\0'; } Every character array in C/C++ ends with a '\0' (NULL) character. It marks the end of the string. If it is not added in the end, then the code may produce garbage characters after the string.

17

## Слайд 17: String Algorithms

String Searching String Searching by KMP algorithm (Knuth Morris Pratt algorithm ) Motivation Problem:  Given 2 strings P(pattern) and T(text), find the number of occurrences of P in T. Basic / Brute Force solution: One obvious and easy to code solution that comes to mind is this: For each index of T, take it as a starting point and find if T i,i+1,..., i +|P|−1  is equal to P.

18

## Слайд 18: String Algorithms

String Searching for i = 0 to length(T)-length(P): Found = true for j = i to i + length(P) - 1: if P[j- i ] not equal to T[j]: Found = False if Found = True answer = answer + 1 This brute force takes  O(|P|⋅|T|)  time in the worst case, which is obviously too slow for large strings.

19

## Слайд 19: String Algorithms

Z Algorithm Z algorithm  is a linear time string matching algorithm which runs in  O(n)  complexity. It is used to find all occurrence of a pattern  P  in a string  S, which is common string searching problem.

20

## Слайд 20: String Algorithms

Z Algorithm Given a string S of length n, the Z Algorithm produces an array Z where Z[ i ] is the length of the longest substring starting from S[ i ] which is also a prefix of S, i.e. the maximum k such that S[j] = S[ i  + j] for all 0 ≤ j < k. Note that Z[ i ] = 0 means that S ≠ S[ i ]. For easier terminology, let's refer to substrings which are also a prefix as prefix-substrings. The algorithm relies on a single, crucial invariant. Iterate over the letters in the string (index  i  from 1 to n−1) and maintain an interval [L, R] which is the interval with maximum R such that 1≤ L ≤  i  ≤ R and S[L...R] is a prefix-substring (if no such interval exists, just let L = R =  − 1). For  i  = 1, simply compute L and R by comparing S[0...] to S[1...]. Z is obtained during this process.

21

## Слайд 21: String Algorithms

Z Algorithm Now, suppose the correct interval [L, R] for i−1 and all of the Z values up to i−1. Compute Z[ i ] and the new [L, R] by the following steps: If  i  > R, then there does not exist a prefix-substring of S that starts before  i  and ends at or after  i. If such a substring existed, [L, R] would have been the interval for that substring rather than its current value. Thus "reset" and compute a new [L, R] by comparing S[0...] to S[ i...] and get Z[ i ] at the same time (Z[ i ] = R − L + 1). Otherwise,  i  ≤ R, so the current [L, R] extends at least to  i. Let k =  i −L. It is known that Z[ i ] ≥ min(Z[k], R −  i  + 1) because S[ i...] matches S[k...] for at least R −  i  + 1 characters (they are in the [L, R] interval which is known to be a prefix-substring). If Z[k] < R −  i  + 1, then there is no longer prefix-substring starting at S[ i ] (or else Z[k] would be larger), meaning Z[ i ] = Z[k] and [L, R] stays the same. The latter is true because [L, R] only changes if there is a prefix-substring starting at S[ i ] that extends beyond R, which is not the case here. If Z[k] ≥ R −  i  + 1, then it is possible for S[ i...] to match S[0...] for more than R −  i  + 1 characters (i.e. past position R). Thus, there's a need to update [L, R] by setting L =  i  and matching from S[R + 1] forward to obtain the new R. Again, Z[ i ] is obtained during this process. The process computes all of the Z values in a single pass over the string, so we're done. Correctness is inherent in the algorithm and is pretty intuitively clear.

22

## Слайд 22: String Algorithms

Z Algorithm Time Complexity The algorithm runs in  O(n) time. Characters are never compared at positions less than  R, and every time a match is found,  R  is increased by one, so there are at most  n  comparisons. Implementation Note that the optimization  L = R =  i  is used when  S ≠ S[ i ] ( it doesn't affect the algorithm since at the next iteration  i  > R  regardless).

23

## Слайд 23: Z Algorithm

int L = 0, R = 0; for (int i = 1; i < n; i++) { if (i > R) { L = R = i; while (R < n && s[R-L] == s[R]) { R++; } z[i] = R-L; R--; } else { int k = i-L; if (z[k] < R-i+1) { z[i] = z[k]; } else { L = i; while (R < n && s[R-L] == s[R]) { R++; } z[i] = R-L; R--; } } }

24

## Слайд 24: Z Algorithm

Applications One application of the Z Algorithm is for the standard string matching problem of finding matches for a pattern  T  of length  m  in a string  S  of length  n. This can be done in  O(n + m)  time by using the Z Algorithm on the string  T Φ S  (that is, concatenating  T,  Φ, and  S ) where  Φ  is a character that matches nothing. The indices  i  with  Z[ i ] = m  correspond to matches of  T  in  S.

25

26

## Слайд 26: Suffix Trees

Pre-requisite : Trie Suffix tree is a compressed trie of all the suffixes of a given string. Suffix trees help in solving a lot of string related problems like pattern matching, finding distinct substrings in a given string, finding longest palindrome etc.

27

## Слайд 27: Suffix Trees

Before going to suffix tree, let's first try to understand what a compressed trie is. Consider the following set of strings: {"banana"," nabd "," bcdef "," bcfeg "," aaaaaa "," aaabaa " } A standard trie for the above set of strings will look like:

28

## Слайд 28: Suffix Trees

And a compressed trie for the given set of strings will look like:

29

## Слайд 29: Suffix Trees

Before going to construction of suffix trees, there is one more thing that should be understood, Implicit Suffix Tree. In Implicit suffix trees, there are atmost   N  leaves, while in normal one there should be exactly  N  leaves. The reason for atmost   N  leaves is one suffix being prefix of another suffix. Following example will make it clear. Consider the string " banana " Implicit Suffix Tree for the above string is shown in image below:

30

## Слайд 30: Suffix Trees

To avoid getting an Implicit Suffix Tree we append a special character that is not equal to any other character of the string. Suppose we append \$ to the given string then, so the new string is " banana\$ ". Now its suffix tree will be

31

## Слайд 31: Suffix Trees

The pseudo code for the brute force approach is given below: function insert(T, string, start_index, length ){ i = start_index while (i < length ) if T.arr [ string [i]]) is NULL T.arr [ string [i]] = new node () T = T.arr [ string [i]] while (i < length ) T.label = T.label+string [i] i = i+1 endwhile return endif j=0, x=i temp = T.arr [ string [i]] while j < temp.label.length and i < length && temp.label [j] = string [i] i = i+1 j = j+1 endwhile if i = length return endif if j = temp.label.length cnt = 0 for k = 'a' to 'z' if temp.arr [k] = NULL cnt = cnt+1 endif endfor if cnt = 27 while i < length temp.label = temp.label + string [i] i = i+1 endwhile else T = temp endifelse

32

## Слайд 32: Suffix Trees

else newInternalNode = new node() k=0 while k < j newInternalNode.label = newInternalNode.label + temp.label [k] k = k+1 endwhile newInternalNode.arr [string[j]] = new node() while k < temp.label.length newInternalNode.arr [string[j]].label+= temp.label [k] k = k+1 endwhile for k = 'a' to 'z' newInternalNode.arr [string[j]]. arr [k] = temp.arr [k] endfor T.arr [string[x]] = newInternalNode T = T.arr [string[x]] endifelse endwhile } function makeSuffixTree (string, length){ Trie T string = concatenate(string, "{") for i = 0 to length insertInTrie (T, string, i, length) } So as mentioned previously the above code will not be correct choice for large values of  N. Ukkonen's Algorithm comes to the rescue here. Ukkonen's Algorithm constructs the suffix tree in a worst case time complexity of  O(N).

33

## Слайд 33: Ukkonen's Algorithm

Ukkonen's Algorithm divides the process of constructing suffix tree into phases and each phase is further divided into extensions. In  ith  phase  ith  character is introduced in the trie. In  ith  phase, all the suffixes of the string S[1..i] are inserted into the trie, and inserting  jth  suffix in a phase is called  jth  extension of that phase. So, in  ith  phase there are  i  extensions and overall there N such phases, where N is the length of given string. Right now it must look like a  O(N^2 ) task, but the algorithm exploits the fact that these are suffixes of same string and introduces several tricks that bring down the time complexity to O(N).

34

## Слайд 34: Ukkonen's Algorithm

Let's see how to perform the  jth  extension of  ith  phase. In  jth  extension of  ith  phase string S[j... i ] is to be inserted. Before going for phase  i, i−1 phases are already complete, that means we have a trie having all suffixes of string S[1...i−1]. So search for the path of string S[j..i−1] in the trie. Now there are 3 possibilities and each of those correspond to one rule, that has to be followed. The complete string is found and path ends at a leaf node. In that case the  ith  character is appended to last edge label and no new node is created. The complete string is found and path ends in between an edge label, and the next character of edge label is not equal to S[ i ] or the path ends at an internal node. In this case new nodes are created. If the path ends in an internal node then a new leaf node is created, and if the path ends in between an edge label then a new internal node and one new leaf node is created. Complete suffix S[j..i−1] is found and the path ended in between an edge label and the next character of that edge label is equal to  ith  character. In that case do nothing.

35

## Слайд 35: Ukkonen's Algorithm

For every extension we need to find the path of a string S[j... i ] in the trie built in previous phases. If we go with brute force approach time complexity will be O(N3), for that we use suffix links, which are explained below: Suffix Links:  Suppose a string X is present in the trie, and its path from root ends at a node v, and string  aX  is also present in the trie where a is any character, and its path from root ends at a node w, then a link from w to v is called a Suffix Link.

36

## Слайд 36: Ukkonen's Algorithm

Now how does a suffix link help? When we have to perform  jth  extension of phase  i, we have to look for end of path of string S[j...i−1], and in the next phase look for end of path of string S[j+1...i−1], but before coming to phase  i, we have performed i−1 phases, that means we have inserted strings S[j..i−1] and S[j+1...i−1] in the trie. Now clearly S[j...i−1] is nothing but S[j]S[j+1...i−1], so we will have a suffix link from node ending at path S[j..i−1] to node ending at path S[j+1...i−1], so instead of traversing down from root for (j+1) th  extension of  ith  phase, we can make use of the suffix link. Use of suffix link makes processing of a phase an O( numberofnodes ) process, and number of nodes in a compressed trie are of O(N). So right now each phase is done in O(N) and there are N such phases, so overall complexity right now is O(N2).

37

## Слайд 37: Ukkonen's Algorithm

Before going further there is one more problem, and that is the edge labels. If the edge labels are stored as strings space complexity will turn out to be O(N2), no matter the what the number of nodes are. So for that instead of using strings as edge label, use two variables for each label and those will denote the start index and end index of the label in the string. That way each label will take constant space and overall space complexity will be O(N).

38

## Слайд 38: Ukkonen's Algorithm

There are several more tricks that help in bringing down the complexity to linear. In any phase, the extension rules are applied in the order. In first few extensions, rule 1 is applied, in the next few extensions rule 2 is applied and in the rest rule 3 is applied. If in  ith  phase rule 3 is applied in extension j for the first time, then in all the extensions after that i.e. in extensions j+1 to  i, rule 3 will be applied, so its ok to halt a phase as soon as rule 3 starts applying.

39

## Последний слайд презентации: ADVANCED ALGORITHMS & DATA STRUCTURES: Ukkonen's Algorithm

Once a leaf node is created it will always remain a leaf node, only edge label of the edge between itself and its parent, will keep on increasing because of application of rule 1, and also for all the leaf node the end index (discussed earlier) will remain same, so in any phase rule 1 can also be applied in a constant time by maintain a global end index for all the leaf nodes. New leaf nodes are created when rule 2 is applied, and in all the extensions in which rule 2 is applied in any phase i−1, in the next phase  i, rule 1 will be applied in all those extensions. So a maximum of N times rule 2 will be applied as there are N leaves, so this means all the phases can be completed in complexity O(N).