PhD dissertation assistance: Constructing Social Knowledge Graph from Twitter Data

Constructing Social Knowledge Graph from Twitter Data Ã‚ Yue Han Loke 1.1 Introduction The current era of technology allows its users to post and share their thoughts, images, and content via networks through different forms of applications and websites such as Twitter, Facebook and Instagram. With the emerging of social media in our daily lives and it is becoming a norm for the current generation to share data, researchers are starting to perform studies on the data that could be collected from social media [1] [2].The context of this research will be solely dedicated to Twitter data due to its publicly available wealth of data and its public Stream API. Twitters tweets can be used to discover new knowledge, such as recommendations, and relationships for data analysis. Tweets in general are short microblogs consisting of maximum 140 characters that can consists of normal sentences to hashtags and tags with @, other short abbreviation of words (gtg, 2night), and different form of a word (yup, nope). Observing how tweets are posted shows the noisy and short lexical natu re of these texts. This presents a challenge to the flexibility of Twitter data analysis. On the other hand, the availability of existing research conducted on entity extraction and entity linking has decreased the gap between entities extracted and the relationships that could be discovered. Since 2014, the introduction of the Named Entity rEcognition and Linking (NEEL) Challenge [3] has proved the significance of automated entity extraction, entity linking and classification appearing in different event streams of English tweets in the research and commercial communities to design and develop systems that could solve the challenging nature in tweets and to mine semantics from them. 1.2 Project Aim The focus of this research aims to construct a social knowledge graph (Knowledge Base) from Twitter data. A knowledge graph is a technique to analyse social media networks using the method of mapping and measurement for both relationships and information flows among group, organizations, and other connected entities in social networks [4]. A few tasks are required to successfully create a knowledge graph based on Twitter data A method to aid in the construction of knowledge graph is by extracting named entitiessuch as persons, organizations, locations, or brands from the tweets [5]. In the domain of this research, the named entity to be referenced in the tweet is defined as a proper noun or acronym if it is found in the NEEL Taxonomy in the Appendix A of [3], and is linked to an English DBpedia [6] referent and a NIL referent. The second component in creating a social knowledge graph is to utilize those extracted entities and link them to their respective entities in a knowledge base. For example, Tweet: The ITEE department is organizing a pizza gettogether at UQ. #awesome ITEE refers to an organization and UQ refers to an organization as well. The annotation for this is [ITEE, organization, NIL1], where NIL1 refers to the unique NIL referent describing the real-world entity ITEE that does not have the equivalent entry in DBpedia and [UQ, Organization, dbp:University_of_Queensland] which represents the RDF triple (subject, predicate, object). 1.3 Project Goals Firstly, getting the Twitter tweets. This can be achieved by crawling Twitter data using Public Stream API[1] available in the Twitter developer website. The Public Stream API allows extraction of Twitter data in real time. Next, entity extraction and typing with the aid of a specifically chosen information extraction pipeline called TwitIE[2] open-source and specific to social media and has been tested most extensively on microblog sentences. This pipeline receives the tweets as input and recognises the entities in the same tweet. The third task is to link those entities mined from tweets to the entities in the available knowledge base. The knowledge base that has been selected for the context of this project is DBpedia. If there is a referent in DBpedia, the entity extracted will be linked to that referent. Thus, the entity type is retrieved based on the category received from the knowledge base. In the event of the unavailability of a referent, a NIL identifier is given as shown in section 1.2. The selection of an entity linking system with the appropriate entity disambiguation and candidate entity generation that receives the extracted entities from the same Tweet and produce a list with all the candidate entities in the knowledge base. The task is to accurately link the correct entity extracted to one of the candidates. The social knowledge graph is an entity-entity graph combining two extracted sources of entities. The first is the analysis of the co-occurrence of those entities in same tweet or same sentence. Besides that, the existing relationships or categories extracted from DBpedia. Thus, the project aims to combine the extraction of co-occurrence of extracted entities and the extracted relationships to create a social knowledge graph to unlock new knowledge from the fusion of the two data sources. Named Entity Recognition (NER), Information Extraction (IE) are generally well researched in the domain of longer text such as newswire. However, overall, microblogs are possibly the hardest kind of content to process. For Twitter, some methods have been proposed by the research community such as [7] that uses a pipeline approach to perform the first tokenisation and POS tagging and topic models were used to find named entities. [8] propose a gradient-descent graph-based method for doing joint text normalisation and recognition, reaching 83.6% F1 measure. Besides that, entity linking in knowledge graphs have been studied in [9] using graph-based method by collectively gather the referent entities of all named entities in the same document and by modelling and exploiting the global interdependence between Entity Linking decisions. However, the combination of NER, and Entity Linking in Twitter tweets is still a new area of research since the NEEL challenge was first established in 2013 . Based on the evaluation conducted in [10] on the NEEL challenge, lexical similarity mention detection strategy that exploit the popularity of the entities and apply a distance similarity functions to rank entities efficiently, and n-gram [11] features are used. Besides that, Conditional Random Forest (CRF) [12] is another mentioned entity extraction strategy. In the entity detection context, graph distances and various ranking features were used. 2.1. Twitter crawling [13] defined the public Twitter Streaming API provides the ability of collecting a sample of user tweets. Using the statuses/filter API provides a constant stream of public Tweets. Multiple optional parameters may be specified such as language and locations. Applying the method CreateStreamingConnection,a POST request to the API has the capability of returning the public statuses as a stream. The rate limit of the Streaming API allows each application to submit up to 5,000 Twitter. [13] Based on the documentation, Twitter currently allows the public to retrieve at most a 1% sample of their data posted on Twitter at a specific time. Twitter will begin to return the sample data to the user when the number of tweets reaches 1% of all tweets on Twitter. According to [14] research comparing Twitter Streaming API and Twitter Firehouse, the final results of the Streaming API depends strongly on the coverage and the type of analysis that the researcher wishes to perform. For example, the researchers found that if given a set of parameters and the number of tweets matching them increases, the coverage of the Streaming API is reduced. Thus, if the research is concerning a filtered content, the Twitter Firehose would be a better choice with regards to its drawback of restrictive cost. However, since our project requires random sampling of Twitter data without filters except for English language, Twitter Streaming API would be an appropriate choice since it is freely available. 2.2. Entity Extraction [15] suggested an open-source pipeline, called TwitIE which is solely dedicated for social media components in GATE [16]. TwitIE consists for 7 parts: tweet import, language identification, tokenisation, gazetteer, sentence splitter, normalisation, part-of-speech tagging, and named entity recogniser. Twitter data is delivered from the Twitter Streaming API in JSON format. TwitIE included a new Format_Twitter plugin in the most recent GATE codebase which converts the tweets in JSON format automatically into GATE documents. This converter is automatically associated with documents names that end in .json, if not text/x-json-twitter should be specified. The TwitIE system uses TextCat a language processing and identification algorithm for its language identification. It has the capability to provide reliable tweet language identification for tweets written in English using the English POS tagger and named entity recogniser. Tokenisation oversees different characters, class sequence and rules. Since the TwitIE system is dealing with microblogs, it treats abbreviations and URLs as one token each by following the Ritters tokenisation scheme. Hashtags and user mentions are considered as two tokens and is covered by a separate annotation hashtags. Normalisation in TwitIE system is divided into two task: the identification of orthographic errors and correction of the errors found. The TwitIE Normaliser is designed specific to social media. TwitIE reuses the ANNIE gazetteer lists which contain lists such as cities, organisations, days of the week, etc. TwiTie uses the adapted version of the Stanford Part-of speech tagger which is tweets tagged with Penn TreeBank(PTB) tagset trained. The results of using the combination of normalisation, gazetteer name lookup, and POS tagger, the performance was increased to 86.93%. It was further increased to 90.54% token accuracy when the PTB tagset was used. Named entity recognition in TwitIE has a +30% absolute precision and +20% abso lute performance increase as compare to ANNIE, mainly respect to date, Organizations and Person. [7] proposed an innovative approach to distant supervision using topic models that pulls large amount of entities gathered from Freebase, and large amount of unlabelled data. Using those entities gathered, the approach combines information about an entitys context across its mentions. T-NER POS Tagging system called T-POS has added new tags for Twitter specific phenomenal retweets such as usernames, urls and hashtags. The system uses clustering to group together distributionally similar words for lexical variations and OOV words. T-POS utilizes the Brown Clusters and Conditional Random Fields. The combination of both features results in the ability to model strong dependencies between adjacent POS tags and make use of highly correlated features. The results of the T-POS are shown on a 4-fold cross validation over 800 tweets. It is proved that T-POS outperforms the Standford tagger, obtaining a 26% reduction in error. Besides that, when trained on 102K tokens, there is an error reduct ion of 41%. The system includes shallow parsing which can identify non-recursive phrases such as noun, verb and prepositional phrases in text. T-NERs shallow parsing component called T-CHUNK, obtained a better performance at shallow parsing of tweets as compared against the off the shelf OpenNLP chunker. As reported, a 22% reduction in error. Another component of the T-NER is the capitalization classifier, T-CAP, which analyse a tweet to predict capitalization. Named entity recognition in T-NER is divided into two components: Named Entity Segmentation using T-SEG, and classifying named entities by applying LabeledLDA. T-SEG uses IOB encoding on sequence-labelling task to represent segmentations. Furthermore, Conditional Random Fields is used for learning and inference. Contextual, dictionary and orthographic features: a set of type lists is included in the in-house dictionaries gathered from Freebase. Additionally, outputs of T-POS, T-CHUNK and T-CAP, and the Brown clusters are used to generate features. The outcome of the T-SEG as stated in the research paper, Compared with the state-of-the-art news-trained Stanford Named Entity Recognizer. T-SEG obtains a 52% increase in F1 score. To address the issues of lack of context in tweets to identify the types of entities they contain and excessive distinctive named entity types present in tweets, the research paper presented and assessed a distantly supervised approach based on LabeledLD. This approach utilizes modelling of every entity as a combination of types. This allows information about an entitys distribution over types to be shared across mentions, naturally handling ambiguous entity strings whose mentions could refer to different types. Based on the empirical experiments conducted, there is a 25% increase in F1 score over the co-training approach to Named Entity Classification suggested by Collins and Singer (1999) when applie d to Twitter. [17] proposed a Twitter adapted version of Kanopy called Kanopy4Tweets that uses the approach of interlinking text documents with a knowledge base by using the relations between concepts and their neighbouring graph structure. The system consists of four parts: Name Entity Recogniser (NER), Named Entity Linking (NEL), Named Entity Disambiguation(NED) and Nil Resources Clustering(NRC). The NER of Kanopy4Tweets uses a TwitIE a Twitter information extraction pipeline mentioned above. For the Named Entity Linking. For NEL, a DBpedia index is build using a selection of datasets to search for suitable DBpedia resource candidates for each extracted entity. The datasets are store in a single binary file using HDT RDF format. This format has compact structures due to its binary representation of RDF data. It allows for faster search functionality without the need of decompression. The datasets can be quickly browse and scan through for a specific object, subject or predicate at glance. For e ach named entity found by NER component, a list of resource candidates retrieved from DBpedia can be obtain using the top-down strategy. One of the challenges found is the large volume of found resource candidates impacts negatively on the processing time for disambiguation process. However, this problem can be resolved by reducing the number of candidates using a ranking method. The proposed ranking method ranks the candidates according to the document score assigned by the indexing engine and selects the top-x elements. The NED takes an input of a list of named entities which are candidate DBpedia resources after the previous NEL process. The best candidate resource for each named entity is selected as output. A relatedness score is calculated based on the number of paths between the resources weighted by the exclusivity of the edges of these paths which is applied to candidates with respect to the candidate resources of all other entities. The input named entities are jointly dis ambiguated and linked to the candidate resources with the highest combined relatedness. NRC is a stage whereby if there are no resource in the knowledge base that can be linked to a named entity extracted. Using the Monge-Elkan similarity measure, the first NIL element is assign into a new cluster, then the next element is used to differentiate from the previous ones. An element is added to a cluster when the similarity between an element and the present clusters is above a fixed threshold, the element is added to that particular cluster, whereas a new cluster is formed if there are no current cluster with a similarity above the threshold is found. 2.3. Entity Extraction and Entity Linking [18]proposed a lexicon-based joint Entity Extraction and Entity Linking approach, where n-grams from tweets are mapped to DBpedia entities. A pre-processing stage cleans and classifies the part-of-speech tags, and normalises the initial tweets converting alphabetic, numeric, and symbolic Unicode characters to ASCII equivalents. Tokenisation is performed on non-characters except special characters joining compound words. The resulting list of tokens is fed into a shingle filter to construct token n-grams from the token stream. In the candidate mapping component, a gazetteer is used to map each token that is compiled from DBpedia redirect labels, disambiguation labels and entities labels that is linked to their own DBpedia entities. All labels are lowercase indexed and linked by exact matches only to the list of candidate entities in the form of tokens. The researcher used a method of prioritizing longer tokens than shorter ones to remove possible overlaps of tokens. For each entity ca ndidate, it considers both local and context-related features via a pipeline of analysis scorers. Examples of local features included are string distance between the candidate labels and the n-gram, the origin of the label, its DBpedia type, the candidates link graph popularity, the level of uncertainty of the token, and the surface form that matches best. On the other hand, the relation between a candidate entity and other candidates with a given context is accessed by the context-related features. Examples of mentioned context-related features are direct links to other context candidates in the DBpedia link graph, co-occurrence of other tokens surface forms in the corresponding Wikipedia article of the candidate under consideration, co-references in Wikipedia article, and further graph based feature of the link graph induced by all candidates of the context graph which includes graph distance measurements, connected component analysis, or centrality and density observations. Besid es that, the candidates are sorted per their confidence score based on how an entity describes a mention. If the confidence score is lower than the threshold chosen, a NIL referent is annotated. [19] proposed a lexical based and n-grams features to look up resources in DBpedia. The role of the entity type was assigned by a Conditional Random Forest (CRF) classifier, that is specifically trained using DBpedia related feature (local features), word embedding (contextual features), temporal popularity knowledge of an entity extracted from Wikipedia page view data, string similarity measures to measure the similarity between the title of the entity and the mention (string distance), and linguistic features, with additional pruning stage to increase the precision of Entity Linking. The whole process of the system is split into five stages: pre-processing, mention candidate generation, mention detection and disambiguation (candidate selection), NIL detection and entity mention typing prediction. In the pre-processing stage, tweet tokenisation and part-of-speech tags were used based on ARK Twitter Part-of-Speech Tagger, together with the tweet timestamps extracted from tweet ID. Th e researchers used an in-house mention-entity dictionary of acronyms. This dictionary computes the n-grams (n [20] research paper proposed an entity linking technique to link named entity mentions appearing in Web text with their corresponding entities in a knowledge base. The solution mentioned is by employing a knowledge base. Due to the vast knowledge shared among communities and the development of information extraction techniques, the existence of automated large scale knowledge bases has been ensured. Thus, this rich information about the worlds entities, their relationships, and their semantic classes which are all possibly populated into a knowledge base, the method of relation extraction techniques is vital to obtain those web data that promotes discovery of useful relationships between entities extracted from text and their extracted relation. Once possible way is to map those entities extracted and associated them to a knowledge base before it could be populated into a knowledge base. The goal of entity linking is to map ever textual entity mention m Ãƒ ¢Ã‹â€ Ã‹â€ M to its corres ponding entry e Ãƒ ¢Ã‹â€ Ã‹â€ E in the knowledge base. In some cases, when the entity mentioned in text does not have its corresponding entity record in the given knowledge base, a NIL referent is given to indicate a special label of un-linkable. It is mentioned in the paper that named entity recognition and entity linking o be jointly perform for both processes to strengthen one another. A method proposed in this paper is candidate entity generation. The objective of the entity linking system is to filter out irrelevant entities in the knowledge base that for each entity extracted. A list of candidates which might be the possible entities that the extracted entity is referring to is retrieved. The paper suggested three techniques to handle this goal such as name based dictionary techniques entity pages, redirect pages, disambiguation pages, bold phrases from the first paragraphs, and hyperlinks in Wikipedia articles. Another method proposed is the surface form expansion from the local document that consists of heuristics based methods and supervised learning methods, and methods based on search engine. In the context of candidate entity ranking method, five categories of methods are advised. The supervised ranking methods, unsupervised ranking methods, independent ranking methods, collective ranking methods and collaborative ranking methods. Lastly, the research paper mentioned ways to evaluate entity linking systems using precision, recall, F1-measure and accuracy. Despite all these methods used in the three main approaches is proposed to handle entity linking system, the paper clarified that it is still unclear which are the best techniques and systems. This is since different entity linking system react or perform differently according to datasets and domains. [21] proposed a new versatile algorithm based on multiple addictive regression trees called S-MART (Structured Multiple Additive Regression Trees) which emphasized on non-linear tree-based models and structured learning. The framework is a generalized Multiple Addictive Regression Trees (MART) but is adapted for structured learning. This proposed algorithm was tested on entity linking primarily focused on tweet entity linking. The evaluation of the algorithm is based on both IE and IR situations. It is shown that non-linear performs better than linear during IE. However, for the IR setting, the results are similar except for LambdaRank, a neural network based model. The adoption of polynomial kernel further improves the performance of entity linking by non-LINEAR SSVM. The paper proved that entity linking of tweets perform better using tree-based non-linear models rather than the alternative linear and non-linear methods in IE and IR driven evaluations. Based on the experiments condu cted, the S-MART framework outperforms the current up-to-date entity linking systems. 2.4. Entity Linking and Knowledge Base Based on [22], an approach to free text relation extraction was proposed. The system was trained to extract the entities from the text from existing large scale knowledge base in a cooperatively manner. Furthermore, it utilizes the learning of low-dimensional embedding of words, entities and relationships from a knowledge base with regards to score functions. Built upon the norm of employing weakly labelled text mention data but with a modified version which extract triples from the existing knowledge bases. Thus, by generalizing from knowledge base, it can learn the plausibility of new triples (h, r, t); h is the left-hand side entity (or head), the right-hand side entity (or tail) and r the relationship linking them, even though this specific triple does not exist. By using all knowledge base triples rather than training only on (mention, relationship), the precision on relation extraction was proved to be significantly improved. [1] presented a novel system for named entity linking over microblog posts by leveraging the linked nature of DBpedia as knowledge base and using graph centrality scoring as disambiguation methods to overcome polysemy and synonymy problems. The motivation for the authors to create this method is because linked entities tend to appear in the same tweets because tweets are topic specific and together with the assumption since tweets are topic specific, related entities tend to appear in the same tweet. Since the system is tackling noisy tweets acronyms handling and Hashtags in the process of entity linking were integrated. The system was compared with TAGME, a state-of-the-art system for named entity linking designed for short text. The results shown that it outperformed TAGME in Precision, Recall and F1 metrics with 68.3%, 70.8% and 69.5%. [23] presented an automated method to populate a Web-scale probabilistic knowledge base called Knowledge Vault (KV) that uses the combination of extractions from the Web such as text documents (TXT), HTML trees (DOM), Html tables (TBL), and Human Annotated pages (ANO). By using RDF triples (subject, predicate, object) with association to a confidence score that represents the probability that KV believes the triple is correct. In addition, all 4 extractors are merged together to form one system called FUSED-EX by constructing a feature vector for each extracted triple. Next, a binary classifier is applied to compute the formula. The advantages of using this fusion extractor is that it can learn the relative reliabilities of each system as well as creating a model of the reliabilities. The benefits of combining multiple extractors include 7% higher confidence triples and a high AUC score (the higher probability that a classifier will choose a randomly chosen positive instance to be ra nked) of 0.927. To overcome the unreliability of facts extracted from the Web, prior knowledge is used. In the domain of this paper, Freebase is used to fit the existing models. Two ways were proposed in the paper which are Path ranking algorithm with AUC scores of 0.884 and the Neural network model with a AUC score of 0.882. A fusion of both methods stated was conducted to increase performance with an increased AUC score of 0.911. With the evidence of the benefits of fusion quantitatively, the authors of the paper proposed another fusion of the prior methods and the extractors to gain additional performance boost. The result of the fusion is a generation of 271M high confidence facts with 33% new facts that are unavailable in Freebase. [24]proposed TremenRank, a graph based model to tackle the target entity disambiguation challenge, task of identifying target entities of the same domain. The motivation of this system is due to the challenges and unreliability of current methods that relies on knowledge resources, the shortness of the context which a target word occurs, and the large scale of the document collected. To overcome these challenges, first TremenRank was built upon the notion of collectively identity target entities in short texts. This reduces memory storage because the graph is constructed locally and is continuously scale-up linearly as per the number of target entities. This graph was created locally via inverted index technology. There are two types of indexes used: the document-to-word index and the word-to-document index. Next, the collection of documents (the shorts texts) are modelled as a multi-layer directed graph that holds various trust scores via propagation. This trust score provided an in dication of the possibility of a true mention in a short text. A series of experiments was conducted on TremenRank and the model is more superior than the current advanced methods with a difference of 24.8% increase in accuracy and 15.2% increase in F1. [25]introduced a probabilistic fusion system called SIGMAKB that integrates strong, high precision knowledge base and weaker, and nosier knowledge bases into a single monolithic knowledge base. The system uses the Consensus Maximization Fusion algorithm to validate, aggregate, and ensemble knowledge extracted from web-scale knowledge bases such as YAGO and NELL and 69 Knowledge Base Population. The algorithm combines multiple supervised classifiers (high-quality and clean KBs), motivated by distant supervision and unsupervised classifiers (noisy KBs) Using this algorithm, a probabilistic interpretation of the results from complementary and conflicting data values can be shown in a singular response to its user. Thus, using a consensus maximization component, the supervised and unsupervised data collected from the method stated above produces a final combined probability for each triple. The standardization of string named entities and alignment of different ontologies is done in the pre-processing stage. Project plan Semester 1 Task Start End Duration(days) Milestone Research: 23/03/2017 Twitter Call 27/02/2017 02/03/2017 4 Entity Recognition 27/02/2017 02/03/2017 4 Entity Extraction 02/03/2017 02/03/2017 7 Entity Linking 09/03/2017 16/03/2017 7 Knowledge Base Fusion 16/03/2017 23/03/2017 7 Proposal 27/02/2017 30/03/2017 30 30/03/2017 Crawling Twitter data using Public Stream API 31/03/2017 15/04/2017 15 15/04/2017 Collect Twitter data for training purp

PhD dissertation assistance

Monday, October 14, 2019

Constructing Social Knowledge Graph from Twitter Data

No comments:

Post a Comment