Statistical regeneration and scalable clustering of big data using MapReduce in the Hadoop ecosystem : a case study of competence management in the computer science career

Bohlouli, Mahdi

Citation Link: https://nbn-resolving.org/urn:nbn:de:hbz:467-10628

Statistical regeneration and scalable clustering of big data using MapReduce in the Hadoop ecosystem : a case study of competence management in the computer science career

Publication Type

Doctoral Thesis

Author

Bohlouli, Mahdi

Institute

Institut für Wissensbasierte Systeme und Wissensmanagement

Subjects

Big Data

MapReduce

Evolutionary Algorithms

Statistical Analysis

Clustering

DDC

004 Informatik

GHBS-Clases

QAT

TUH

TVUK

Issue Date

2016

Abstract

Any adaptive analysis of domain specific data demands fully generic, sophisticated, and customizable methods. A mathematical representation and modelling of domain specific requirements ensure achieving this goal. In talent analytics and job knowledge management era, a mathematical model should resolve person-job-fit and skill mismatch problems as well as under-qualification concerns about workers and job-seekers. This issue becomes even greater challenge for large job centers and enterprises when they should process data intensive matching of talents and various job positions at the same time. In other words, it should result in the large-scale assignment of best-fit (right) talents with right expertise to the right positions at the right time. The diversity in the domain of human resource management imposes large volumes of data. Hence, extending approaches towards speeding up analytical processes is essential.

The main focus of this dissertation is on efficient and scalable modelling, representation and analysis of career knowledge by proposing a hybrid approach based on big data technologies. In this regard three types of the data have been prepared through profiling, namely as talent profiles, job profiles and competence development profiles. The main focus is divided into three matching problems: (a) Scalable matching of talent profiles with job profiles towards person-job-fit using evolutionary MapReduce based K-Means (EMRKM) clustering and TOPSIS methods. (b) Matching of competence goals of under-qualified talents, prioritized using Arithmetic Hierarchy Processing (AHP), with competence development profiles towards improving competitiveness of job seekers using K-Means and TOPSIS algorithms. (c) Matching of competence development profiles with the job profiles. In order to evaluate the achievements of this work, the hybrid approach is applied in the computer science academic career.

To this aim, a generic Career Knowledge Representation (CKR) model is proposed in this research in order to cover all required competences in a wide variety of careers. The CKR model is the base of setting up profiles and has been evaluated by careful survey analysis through domain experts. The volume of collected data from the web is so large that any type of analytics demands for the use of big data technology. Accordingly, the original collected data of 200 employees from the web as well as through assessments have been statistically analyzed and rescaled to 15 million employee data using the uniform distribution. In order to find the best-fit employee which resolves skill mismatch challenge, the talent profiles are first clustered using EMRKM algorithm. The cluster with the closest Euclidean distance of its centroid with desired job profile is regarded as the talent cluster. Talents of this cluster are sorted on the basis of TOPSIS method towards selecting the best-fit candidate in the cluster. Similar methods are used for the matching problem in recommending competence improvement programs such as Vocational Educational Training (VET) for under-qualified talents.

An analysis of achieved results shows that 78% domain experts believe that the proposed CKR model is beneficial for their industries and showed an interest to integrate the model in their workforce development strategies. The use of the uniform distribution in the regeneration of data showed a success rate of 94.27% at the significance level of 0.05 and 97.92% at the significance level of 0.01. The proposed EMRKM algorithm handles clustering of the large-scale data 47 times faster than traditional K-Means clustering and 2.3 times faster than existing MapReduce based clustering methods such as the one provided in the Apache Mahout. Moreover, any investigation in developing further metrics for various domains such as nursing, politics and engineering based on proposed CKR model as well as discovering career data through web crawling methods will promote this work. In addition, novel text mining methods in order to discover job knowledge from large volumes of streamed social media data, web and digital sources and linked open data will improve the quality of data in talent profiles and enrich the proposed approach.

URN

nbn:de:hbz:467-10628

URI

https://dspace.ub.uni-siegen.de/handle/ubsi/1062

License

https://dspace.ub.uni-siegen.de/static/license.txt

File(s)