Vigi4Med Scraper: A Framework for Web Forum Structured Gegevens Extraction and Semantic Representation

Bissan Audeh

1 University of Lyon, MINES Saint-É,tienne, CNRS, Hubert Curien Laboratory, UMR 5516, Saint-É,tienne, France

Michel Beigbeder

1 University of Lyon, MINES Saint-É,tienne, CNRS, Hubert Curien Laboratory, UMR 5516, Saint-É,tienne, France

Antoine Zimmermann

1 University of Lyon, MINES Saint-É,tienne, CNRS, Hubert Curien Laboratory, UMR 5516, Saint-É,tienne, France

Philippe Jaillon

Two Ecole Nationale Supé,rieure des Mines den Saint-É,tienne, Saint-É,tienne, France

Cé,dric Bousquet

Three INSERM, U1142, LIMICS, Paris, France


The extraction of information from social media is an essential yet complicated step for gegevens analysis ter numerous domains. Te this paper, wij present Vigi4Med Scraper, a generic open source framework for extracting structured gegevens from web forums. Our framework is very configurable, using a configuration verkeersopstopping, the user can loosely choose the gegevens to samenvatting from any web forum. The extracted gegevens are anonymized and represented te a semantic structure using Resource Description Framework (RDF) graphs. This representation enables efficient manipulation by gegevens analysis algorithms and permits the collected gegevens to be directly linked to any existing semantic resource. To avoid server overcharge, an integrated proxy with caching functionality imposes a minimal delay inbetween sequential requests. Vigi4Med Scraper represents the very first step of Vigi4Med, a project to detect adverse drug reactions (ADRs) from social networks founded by the French drug safety agency Agence Nationale den Sé,curité, du Mé,dicament (ANSM). Vigi4Med Scraper has successfully extracted greater than 200 gigabytes of gegevens from the web forums of overheen 20 different websites.

1 Introduction

The extraction of useful information from websites, referred to spil scraping [1], is a significant challenging task on several levels due to the large amount of information available on the internet. Very first, a scraping system vereiste efficiently access web pages by avoiding non-informative gegevens and duplicate pages. Then, only useful gegevens should be detected and extracted. The extracted gegevens should be represented ter an exploitable structure to facilitate gegevens analysis. Privacy is another major concern when manipulating web gegevens [Two]. Protecting the identity and private life of the user should be taken into consideration [Three, Four], particularly te sensitive domains such spil health [Five] because an enhancing number of users today are sharing their individual information on social media such spil web forums.

A web forum is a virtual podium for voicing individual and communal opinions, comments, practices, thoughts, and sentiments [6]. Extracting gegevens from thesis online communities can produce rich and diverse skill resources [7, 8]. The specificity of web forums is that they share a common layout. Ter particular, the posts are introduced ter chronological order and organized within threads. This well-organized structure is very useful for targeting specific gegevens within forums. Ter general, gegevens extraction from web forums involves retrieving the linksaf that lead to threads or posts and obtaining the actual gegevens objects of those threads and posts. A gegevens object can be any information related to user participation te the forum, such spil the publication date, author pseudonyms and the postbode title or content.

Te this paper, wij present Vigi4Med Scraper, a framework that extracts gegevens objects from web forums and represents them te a semantic structure while maintaining the user’,s privacy. The Vigi4Med Scraper framework consists of three main blocks: gegevens extraction from web forums, semantic gegevens representation and anonymization. Whereas each one of thesis functionalities corresponds to an active research field, wij combine them te a very configurable solution. Our system generates anonymized semantic graphs from any forum-like webstek according to a user-determined configuration verkeersopstopping. With this configuration verkeersopstopping, the user can loosely specify the desired segments of the forum to samenvatting and denote the correspondence inbetween thesis segments and the desired semantic components. This plasticity ter choosing the segments of gegevens to samenvatting permits Vigi4Med Scraper to treat any forum-like webstek, which positions it spil a generic solution for gegevens extraction from web forums.

Vigi4Med Scraper wasgoed used within a pharmacovigilance project. Pharmacovigilance is defined by the World Health Organization spil “,the science and activities relating to the detection, assessment, understanding and prevention of adverse effects or any other drug-related problem”, [9]. Te this domain, analyzing web forums is an adequate way to generate fresh skill about adverse drug reactions (ADRs) [Ten]. The task imposes two rigorous requirements for gegevens extraction policy: protecting the privacy of forum users and preserving the spectacle of the targeted sites. Protecting user privacy is utterly critical when treating private health gegevens, however, most of the existing web crawling and gegevens extraction approaches blindly gather all types of information without any consideration of privacy. Ter addition, medical forums are exceedingly popular and have large-scale usage. Thus, the basic requirement of preserving the vertoning of the crawled websites should be rigorously fulfilled, and a particularly respectful attitude towards the hosting server of medical forums should be considered.

This paper is organized spil goes after. Wij commence by presenting an overview of related work te Section Two. The overall structure of the Vigi4Med Scraper is described ter detail ter Section Trio. Section Four shows how the framework wasgoed applied to the Vigi4Med project. A discussion comparing our system with previous work is proposed ter Section Five. Ultimately, the availability of Vigi4Med Scraper and future directions are introduced te Section 6.

Two Related Work

Obtaining gegevens objects from web forums involves crawling for informative pages and extracting structured gegevens to precisely retrieve the gegevens of rente within a pagina. Crawling web forums has bot addressed ter several studies [11, 12]. The Houtvezelplaat Forum treatment [13] simulates the natural process of navigating through a forum. It starts by collecting the linksaf from the huis pagina and lower levels (“,houtvezelplaat”,, “,thread”,) on up to the “,postbode”, level. This treatment does not samenvatting gegevens objects spil it does not process the structure of the collected pages. Straks, iRobot wasgoed proposed by [14] to crawl web forums. This treatment has an offline component that extracts the sitemap (or verbinding skeleton) from sample pages and attempts to find the optimal traversal path from one pagina to another to avoid duplicate pages. The initial version of iRobot did not retrieve specific gegevens objects from web forums, but an extension wasgoed proposed ter [15, 16]. This fresh treatment also used an offline sampling mode to build the webpagina schrijfmap, however, it explicitly considered page-flipping linksaf, permitting it to recognize posts belonging to the same thread (or threads belonging to the same forum), even if they were split into several HTML pages. Albeit [16] also retrieves gegevens objects from web forums, the vertoning of iRobot depends intensely on the quantity and quality of the sampled pages. Furthermore, iRobot wasgoed proven to have ineffective robustness by [17], who proposed a fresh treatment called Concentrate (Forum Crawler Under Supervision). Concentrate [17] learns regular expression patterns to samenvatting the main features from a sample collection and uses thesis expressions to rechtstreeks online crawling. The authors evaluated their treatment for high scale crawling. A selected number of gegevens objects were used spil features for a Support Vector Machine (SVM) classifier. The gegevens objects were chosen to help the classifier distinguish a houtvezelplaat pagina from a thread pagina, but the extraction of thesis gegevens objects wasgoed not the final aim of the treatment.

Approaches that samenvatting structured gegevens from web pages have bot extensively studied. The procedures implemented to achieve structured gegevens extraction are called wrappers [Eighteen]. Several mechanisms, such spil regular expressions and tree-based methods, can be used to generate a wrapper. The Document Object Proefje (Onverstandig) is commonly used to samenvatting gegevens from web pages. A Onverstandig tree represents the pages’, information te a structure that can be exploited by special queries (XPath queries). Albeit manual approaches permit one to specify the gegevens of rente, they rely strongly on users with the adequate technical expertise. Automatic approaches were introduced to lower the amount of user effort required for this task. The majority of thesis approaches still require human intervention to label training examples (e.g., [Nineteen]). Fully automatic approaches attempt to detect nested or repeated patterns to target interesting contents (e.g., [20]), but they suffer from a higher risk of extracting non-informative gegevens and are difficult to customize.

None of the previous studies focused on the semantic representation of gegevens or privacy. Semantic representation permits for a powerful and lithe description of skill. Quickly after the emergence of the Semantic Web [21], this type of representation, which is based on concepts and semantic relations, has garnered significant rente, particularly te the medical domain [22]. Privacy is a main concern te web forum crawling. Protecting the privacy of web-collected gegevens is a complicated punt [23]. Any information that can identify a specific user should not be straightforward to expose, particularly when working with health gegevens within a medical domain such spil pharmacovigilance. One way to protect privacy is to anonymize sensitive gegevens. Ter the literature, ter addition to basic pseudonymization (substituting an identifier with a key), several approaches exist for anonymization, such spil k-anonymat [24, 25] and differential privacy [26]. The choice of an anonymization algorithm depends on the setting of the application. Te particular, it depends on who has access to which part of the gegevens, and whether the anonymization is desired to be reversible or not [27].

With respect to previous research, each of the aforementioned web forum gegevens extraction approaches lacks at least one of the following elements:

Related movie: Hitler reacts to BitConnect Shutdown

Leave a Reply