« Home « Kết quả tìm kiếm

Recursive join processing in big data environment


Tóm tắt Xem thử

- RECURSIVE JOIN PROCESSING IN BIG DATA ENVIRONMENT.
- A join operation is one of the most common operations appearing in many data queries.
- Specially, a recursive join is a join type used to query hierarchical data but it is more extremely complex and costly.
- The evaluation of the recursive join in MapReduce includes some iterations of two tasks of a join task and an incremental computation task.
- In this study, we thus propose a simple but efficient approach for Big recursive joins based on reducing by half the number of the required iterations in the Spark environment.
- This improvement leads to significantly reducing the number of the required tasks as well as the amount of the intermediate data generated and transferred over the network.
- Our experimental results show that an improved recursive join is more efficient and faster than a traditional one on large-scale datasets..
- Recursive join.
- Managing, storing, querying, and processing Big Data has been and still is an issue of concern in order to take advantage of the value of the existing data sources.
- The MapReduce model has become one of the most popular big data processing models today [12]..
- One of the most common operations involved in data analysis and data retrieval is a join query.
- A recursive join is a complex operation with a huge cost and it is not directly supported by the MapReduce model.
- This paper is an extended and edited version of the report selected from ones presented at the 13th National Conference on Fundamental and Applied Information Technology Research (FAIR’13), Nha Trang University, Khanh Hoa .
- [28] proposed an optimization of the Semi-Naive algorithm for recursive queries in Hadoop MapReduce environment.
- In Hadoop, the implementation of the Semi- Naive algorithm requires three MapReduce group tasks: one for joining tuples (Join), one for calculating and deleting duplicate tuples (Projection), and one for combining the result with the previous results (Difference).
- A recent study [24] has proposed a solution to optimize recursive join in Spark environ- ment.
- The research team improves recursive join with Semi-Naive algorithm using Intersec- tion Bloom Filter to remove redundant data that does not participate in the join operation to significantly reduce the costs involved.
- In the studies mentioned above, the authors take advantage of iterative processing, cache, and filter mechanisms to be able to optimize recursive join.
- These studies focus on redundant data filtering or applying Spark’s utilities to support recursive join without directly improving the Semi-Naive algorithm.
- In the Semi-Naive algorithm for recursive join, at each iteration, we perform joining on dataset K and this dataset is unchanged.
- Thus, if we perform one more operation of joining dataset K on the same iteration, the number of iterations of the algorithm will be decreased by two.
- Section 3 provides our proposal to effectively process recursive join in Big Data environment.
- The conclusion of the paper is presented in Section 5..
- The core of Apache Hadoop consists of the storage unit, called the Hadoop Distributed Set System (HDFS), and the processing unit, the MapReduce processing model (Fig.
- That is the reason users often install Spark on top of the Hadoop platform to use Hadoop’s HDFS.
- A choice of k is determined by the intended false positive rate f of the filter (Eq.
- In contrast, if any of the bits is 0 then definitely z.
- The key-value pairs of the same key will be returned to the same Reducer to produce the results for that key with a Reduce function.
- To illustrate the join operation in the MapReduce environment, we consider a join processing for two datasets R(k 1 , v) and S(w, k 2.
- There are some types of join queries such as two-way join, multi-way join, or recursive join.
- Of the three types of join above, recursive join is a fairly complex query with a large cost and is applied in many fields that require repeated calculations such as PageRank [17], Graph mining [27], and Friend graph [20, 26] in social networks.
- Recursive join is also known as “fix-point” join, with repeated operations until the iteration no longer produces new results.
- A good example of a recursive join is in a friend-of-friend query for a person in social networks.
- Given an initial logical matrix v ∗ v of the a ij elements on a v-node graph, a ij is 1 if there is an arc from node i to node j, otherwise a ij is 0.
- The second type is the iterative algorithm developed in the context of recursive queries..
- Initialize F with K, F will contain all tuples in the transitive closure of the graph until the end of the algorithm.
- These pairs cannot be detected in the previous loop (thanks to the shortest path between them os of length n + 1).
- At the end of iteration n, each tuple in ∆F represents a new tuple discovered in the transitive closure.
- RECURSIVE JOIN ALGORITHM ON LARGE DATASETS 3.1.
- Recursive join in MapReduce.
- Semi-Naive is a popular linear transitive closure algorithm and suitable for deployment in the MapReduce environment.
- Semi-Naive algorithm for recursive join (briefly called RJ) in MapReduce environment can be represented as follows..
- Recursive join with Semi-Naive - RJ F 0 = K, ∆F 0 = K, i = 1.
- Semi-Naive algorithm for recursive join in MapReduce will be performed by two task groups.
- The join tasks perform the join operation of the tuples, and the tasks for incremental computing dataset are to remove the duplicated tuples found in the previous iteration.
- In line (3) of the algorithm presented above, we can see that in each iteration, dataset K is constant.
- The most important job in the Semi-Naive algorithm is to join dataset K and dataset ∆F .
- In other words, we perform the three-way join of dataset K in the same iteration thus the required iterations will be reduced by two..
- Proposal approach for recursive join.
- The proposed approach using Semi-Naive algorithm for recursive join (briefly called PRJ) in MapReduce environment is represented as follows..
- we will illustrate our proposal for recursive join..
- The results of this iteration will be used to compute a cartesian product with K in the next iteration.
- Proposal Recursive Join with Semi-Naive - PRJ.
- The results of recursive join with RJ approach in each iteration is provided in Table 3..
- At each iteration, the join results will be used to join with two original input datasets K in the next iteration.
- The results of recursive join with PRJ approach in each iteration is provided in Table 4..
- It is clear that the number of iterations of the proposed algorithm will be reduced by two in comparison with the original one.
- In other words, one iteration of this algorithm will be equivalent to two iterations of the original Semi-Naive algorithm for recursive join.
- In fact, recursive join of dataset K is to calculate the transitive closure for graph K.
- With PRJ approach, new tuples found in the transitive closure in one iteration is equal to that of two iterations in RJ approach.
- Thus, by around half of n, all tuples in the transitive closure of K are found in PRJ approach.
- The equation 2 shows the relation between the iterations of the two approaches.
- The key-value pairs of R and T are sent to the rows and columns in the Reducers matrix, and a key-value pair of S is sent to a reducer to perform join processing..
- In the study [23] by T.C.
- (|R|.α) 2 , where r is the number of Reducers, |R.
- This conclusion is made when considering the size of the intermediate dataset generated from the above two solutions as follows.
- The study [23] has shown that the amount of intermediate data generated by one round three-way join is less than the other thus we will use this method for optimizing recursive join operation..
- To avoid the limitation of sending a large number of key-value pairs to rows and columns of the reducers matrix, we will calculate the positions required to send data and save in partitionTable..
- When generating the key-value pairs for R and T datasets to the rows or columns of the reducers matrix, we must first check which positions of the matrix to transmit data based on the partitionTable (illustrate in Fig.
- This helps to significantly limit the number of key-value pairs emitting.
- Using partitionTable to send tuples to reducers in one round three-way join The pseudocode of our proposal for recursive join is shown bellow..
- In addition, to reduce the amount of unnecessary data involved in the join operation, we built two bloom filters for filtering redundant data.
- Experiments will conduct to evaluate the two approaches RJ and PRJ for recursive join..
- Number of iterations XX XX.
- The number of iterations corresponding to the two approaches was recorded.
- The performance of one round three-way join in Semi-Naive algorithm for recursive join has greatly reduced the execution time.
- However, with a small dataset, the processing speed is quite the same of the two approaches.
- Figure 8 shows the amount of intermediate data to be transmitted over the network for a recursive join of the two approaches.
- The decrease in the number of iterations reduces the number of MapReduce jobs, which in turn reduces the amount of intermediate data.
- Besides, filters help a lot for reducing redundant data that do not participating in join operation to optimize the recursive join..
- The study has fully analyzed the recursive join in the big data processing environment with MapReduce and proposed important improvements to significantly reduce the costs involved.
- To avoid extreme generating key-value pairs to send to the whole rows and columns of the reducers matrix, we construct a partitionTable that can partially reduce the number of unnecessary data.
- Besides, the use of filters is also to remove redundant data that does not participate in the join operation..
- In brief, this study has come up with a new approach to effectively optimize recursive join in MapReduce environment.
- The experiments show the effectiveness of improvement for Semi-Naive in recursive join in MapReduce.
- Ullman, “Transitive closure and recursive datalog implemented on clus- ters,” in Proceedings of the 15th International Conference on Extending Database Technology, ser.
- Ullman, “Map-reduce extensions and recursive queries,” in Proceedings of the 14th International Conference on Extending Database Technology, 2011, pp.
- Ullman, “Optimizing joins in a map-reduce environment,” in Proceedings of the 13th International Conference on Extending Database Technology, ser.
- Jagadish, “Direct algorithms for computing the transitive closure of database relations,” in Proceedings of the 13th International Conference on Very Large Data Bases, ser.
- Bloom, “Space/time trade-offs in hash coding with allowable errors,” Communications of the ACM, vol.
- Ghemawat, “Mapreduce: simplified data processing on large clusters,” Commu- nications of the ACM, vol.
- Luo, “Theory and network applications of dynamic bloom filters,” in In Proceedings IEEE INFOCOM 2006.
- Ioannidis, “On the computation of the transitive closure of relational operators,” in Pro- ceedings of the 12th International Conference on Very Large Data Bases, ser.
- Venkatesh, “Three-way joins on mapreduce: An experimental study,” in IISA 2014, The 5th International Conference on Information, Intelligence, Systems and Applications, July 2014, pp.
- Rigaux, “Toward intersection filter-based optimization for joins in mapreduce,” in Proceedings of the 2nd International Workshop on Cloud Intelligence, ser..
- Rigaux, “A theoretical and experimental comparison of filter- based equijoins in mapreduce,” in Transactions on Large-Scale Data- and Knowledge-Centered Systems XXV - Volume 9620.
- Trieu, “Efficient processing of recursive joins on large-scale datasets in spark,” in Advanced Computational Methods for Knowledge En- gineering, H.
- Lam, “Socialite: Datalog extensions for efficient social network anal- ysis,” in 2013 IEEE 29th International Conference on Data Engineering (ICDE), 2013, pp..
- Suciu, “Optimizing large-scale semi-na¨ıve datalog evalua- tion in hadoop,” in Datalog in Academia and Industry, P.
- Warren, “A modification of warshall’s algorithm for the transitive closure of binary rela- tions,” Communications of the ACM, vol

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt