Abstract: MUCKE addresses the stream of multimedia social data with new and reliable knowledge extraction models designed for multilingual and multimodal data shared on social networks. It departs from current knowledge extraction models, which are mainly quantitative, by giving a high importance to the quality of the processed data, in order to protect the user not just from spam, but also from an avalanche of equally [topically] relevant data. It does so using two central innovations: automatic user credibility estimation for multimedia streams and adaptive multimedia concept similarity.
Abstract: In recent years, data streams have become the main source of what we call the big data today. In many important domains such as health-care, finance, sensor networks , molecular biology and lots of others, we do not talk any more about very large data-sets; we talk about data-sets that grow continuously at a rate of several million new entries a day. While most of the current data mining techniques have been successfully applied to very large databases, the open-ended nature of data streams as well as the limited memory and time resources make traditional data mining approaches become unsuitable. Consequently, we have been focusing on designing a data mining technique that, on the one hand, meets the single-pass constraint by reading the stream data only once, and on the other hand, can adapt and take the benefits of a distributed processing environment.
Abstract: Self-organizing construction principles are a natural fit for large-scale distributed system in unpredictable deployment environments. These principles allow a system to systematically converge to a global state by means of simple, uncoordinated actions by individual peers. Indexing services based on the distributed hash table (DHT) abstraction have been established as a solid foundation for large-scale distributed applications. Most DHT algorithms assume explicit construction. The creation and maintenance of the overlay structure relies on the exploration and update of an existing and already stabilized structure. We evaluate in this paper the practical interest of self-organizing principles, and in particular gossip-based overlay construction protocols, to bootstrap and maintain various DHT implementations. Based on the seminal work on T-Chord, a self-organizing version of Chord using the T-Man overlay construction service, we contribute three additional self-organizing DHTs: T-Pastry, T- Kademlia and T-Kelips. We conduct an experimental evaluation of the cost and performance of each of these designs when deployed on up to 600 nodes using a prototype implementation. Our conclusion is that, while providing equivalent performance in a stabilized system, self-organizing DHTs are able to sustain and recover from higher level of churn than their explicitly created counterparts, and should therefore be considered as a method of choice for deploying robust indexing layers in adverse environments.
Abstract: Security guarantees are required features for cloud services. Among these, the preservation of user data privacy is critical. However, the necessity to perform computation over user data by the service providers makes this guarantee hard to provide. This necessity often becomes a requirement for services that depend on the nature of the data, such as publish/subscribe (pub/sub) routing solutions. Most of proposed solutions for privacy-preserving pub/sub fail to satisfactorily address scalability and dependability along privacy issues. In this short talk, we define a set of requirements that should be met by a scalable and dependable privacy-preserving pub/sub solution.
Abstract: Projection Pursuit (PP) is a general framework for linear feature extraction, with large applicability in dimensionality reduction, data visualization, classification and outlier detection. Feature extraction algorithms like PCA and ICA and supervised classification algorithms like LDA are some popular examples of algorithms that adhere to the family of PP techniques. Generally, PP aims at identifying linear projections in data that reveal interesting properties/distribution. After a review of popular PP techniques and applications, contributions in this area based on Evolutionary algorithms will be described.
Abstract: Due to rising energy costs, energy-efficient data centers have gained increasingly more attention in research and practice. Optimizations targeting energy efficiency are usually performed on an isolated level, either by producing more efficient hardware, by reducing the number of nodes simultaneously active in a data center, or by applying dynamic voltage and frequency scaling (DVFS). Energy consumption is, however, highly application dependent. We therefore argue that, for best energy efficiency, it is necessary to combine different measures both at the programming and at the runtime level.
As there is a tradeoff between execution time and power consumption, we vary both independently to get insights on how they affect the total energy consumption. We choose frequency scaling for lowering the power consumption and heterogeneous processing units for reducing the execution time. While these options showed to be effective already in the literature, the lack of energy-efficient software in practice suggests missing incentives for energy-efficient programming. In fact, programming heterogeneous applications is a challenging task, due to different memory models of the underlying processors and the requirement of using different programming languages for the same tasks. We propose to use the Actor Model as a basis for efficient and simple programming, and extend it to run seamlessly on either a CPU or a GPU. In a second step, we automatically balance the load between the existing processing units. With heterogeneous actors we are able to save 40-80% of energy in comparison to CPU-only applications, additionally increasing programmability.
Abstract: Erasure codes have been widely used over the last decade to implement reliable data stores. They offer interesting trade-offs between efficiency, reliability, and storage overhead. Indeed, a distributed data store holding encoded data blocks can tolerate the failure of multiple nodes while requiring only a fraction of the space necessary for plain replication, albeit at an increased encoding and decoding cost. There exists nowadays a number of libraries implementing several variations of erasure codes, which notably differ in terms of complexity and implementation-specific optimizations. Seven years ago, Plank et al.  have conducted a comprehensive performance evaluation of open-source erasure coding libraries available at the time to compare their raw performance and measure the impact of differ parameter configurations. In the present experimental study, we take a fresh perspective at the state of the art of erasure coding libraries. Not only do we cover a wider set of libraries running on modern hardware, but we also consider their efficiency when used in realistic settings for cloud-based storage, namely when deployed across several nodes in a data centre. Our measurements therefore account for the end-to-end costs of data accesses over several distributed nodes, including the encoding and decoding costs, and shed light on the performance one can expect from the various libraries when deployed in a real system. Our results reveal important differences in the efficiency of the different libraries, notably due to the type of coding algorithm and the use of hardware-specific optimizations.