### Workshop Day: 14 August 2016

Missed the keynote as the earliest BART reached Powell St by 8:45. Wanted to attend the whole of the time-series workshop but there were other interesting things that had my attention. Nevertheless, anyone interested in time-series classification should look at the comprehensive evaluation by Anthony Bagnall et. al., that was presented at the workshop. The key message remains the same as before (also Bagnall): It is very hard to beat Nearest Neighbor-based DTW, which is fast and effective. But I have my reservations on the kind of “time-series” these may be applied to. The outliers detection workshop was good. Jeff Schneider’s talk on “converting anomalies to known phenomenon” was good, particularly for the aspect of non-parametric Renyi Divergence methods for anomaly detection. Two other talks were intriguing. Custom tree algorithm (ARDT) for dealing with class imbalance issues. Better thresholds for splitting can be generated using Renyi Entropy instead of Shannon entropy as is commonly used for Information Gain in decision trees. The other talk on “Fast and Accurate Kmeans Clustering with Outliers — Shalmoli Gupta” was an approach for dealing with outliers in the k-means setup. Instead of a 2-step process of removing outliers and then applying k-means, two approaches were presented. First, the sampling approach, deals with sampling the data points with 1/z probability each where z is the number of outliers in the dataset. The k-means is then inferred over this sample. Unless z is very large, this ensures that the k-means are not outlier dependent. The second approach solves a linear program to jointly discover centroid and outliers. They empirically show that, although computationally expensive, the LP is not much better than the first (sampling) approach. Hence, better to use the sampling approach for robust k-means. However, both these approaches require the knowledge of “z”, or the number of outliers in the dataset. That is impractical. Moreover, the experiments were performed with the full knowledge of the number of outliers and the number of clusters in the dataset. Practically, that is not possible to know apriori.

### Conference Day 1: 15 August 2016

The keynote from Jennifer Tour Chayes from Microsoft Research was all about how there was a need for a limiting theory on Graph Theory, just like there is Thermodynamics for Physics and how that lead to the conceptualization of Graphons and the kind of applications that might be useful in, especially in understanding large networks and generating large networks. The best student paper award was received by Christos Faloutsos’s team for their work on FRAUDAR, a graph-based approach for detecting fraudulent reviews and reviewers, even in the presence of camouflage, or when fraudsters masquerade as honest reviewers by hijacking their reviews. In the large-scale data mining session, the talk on XGBoost, by far the most successful implementation of gradient boosted trees, highlighted improved accuracy, speed, scalability and portability over vanilla GBDTs. The improved accuracy is a result of the regularization term, while the improved speed is the result of caching and sparsity-aware splitting criteria. Particularly interesting on the second day was the plenary panel on Deep Learning “Is Deep Learning the new 42?“, a reference to the computer in the Douglas Adams’ Hitchhiker’s Guide to the Galaxy. Prof. Jitendra Malik, Prof. Isabelle Guyon, Prof. Pedro Domingos, Prof. Nando de Freitas, and Prof. Jennifer Neville participated in the panel that covered all areas from interpretability, explanability, hype, data scarcity, energy consumption and other issues commonly associated with Deep Learning. Several anecdotal examples were given, both for the failures and successes of Deep Learning. Particular to note were that Deep Learning might be the latest craze, but it is an improvement. It may not stay, but will evolve into the future state of art, such as Representation Learning. Open challenges that are likely to become important in the coming years are Causality, Representation Learning, Explanability, Privacy-preservation and bias prevention, and my personal favorite, learning from less, aka unsupervised learning. Regarding energy consumption, it is important to note that the human brain is highly efficient and uses only 20 watts. If we were to replicate the human brain with neuro-morphic chips, we would need the entire energy supply of New York and San Francisco put together though. We still have far to go. Some interesting comments were made about how the algorithms or models are not biased, but rather stupid if they do not do the meaningful thing. debate also arose on the idea where explanability is more important than accuracy or vice versa? For example, do you want a highly accurate complex non-interpretable algorithm that correctly detects cancer, or do you wish a less accurate simpler and interpretable model. Interesting point. Moreover, there were suggestions of using a decision tree on top of Neural outputs to give an impression of interpretability to some who is highly desirous of it but doing complicated things under the hood. The last session on invited talks on data science was not really useful to me, barring the first talk of Prof. Jeff Schneider who highlighted the challenges of Active Optimization (also Design of Experiments or Bayesian Optimization).

### Conference Day 2: 16 August 2016

The day started with a keynote by George Papadoupoulos, a researcher who in the last six years transitioned to being a venture capitalist. He gave a talk on big-data investments from the perspective of the VC community. I think the key takeaways were that funding usually follows value generation and that successful exits (going public!) are far and in-between. It takes rougly 8.5 years for an investment to bear fruition. He also emphasized on the importance of getting a good founding partner as you are going to be stuck with him or her for at least 8.5 years :). He suggested that the more you take the human out of the loop, the more you are prized and valued. Merely creating analytics toolkits that involve human, albeit needed, has the lowest value while predictive analytics has moderate value. When asked about areas he does not invest in, he said that he does not invest in startups that are going against existing monopolies that have a sheer scale (for example Amazon AWS). This was followed by a panel on VC insights in big-data investments. They echoed more of what Dr. Papadoupoulos had said earlier. But made some key points. Do not worry about markets, money, funding and solve an incredibly hard problem that you see is currently unsolved and everything else will fall in place. There was also a suggestion that technology or algorithmic prowess is hardly a differentiator anymore, given that most platforms are public or open source. It is the availability and exclusive accessibility to data that is the key differentiator. Then I attended some talks on Deep Learning and Embedded systems, which had some interesting papers. There was a SmartReply system from Google Gmail that automatically suggests diverse, unique and intelligent responses for mobile and are already being used to answer 10% of all mobile emails for their users. It was interesting to see a pipeline of simple, intuitive and proven technologies (by now, LSTM-based RNNs seq2seq models are nauseating, but they work!!) being used for a concrete application that is being used. Then there was a talk by Jure Leskovec’s team on node2vec, a word2vec like embedding generation algorithm for networks and graphs. If we can represent each node as a vector, while still retaining some information about its neighborhood, it would mean that many machine learning algorithms can be directly applied to such a vector representation of the graph. They had better results than spectral methods (Matrix factorization of the adjacency matrix), but it is not clear to me how they get that. Reading required! Then I attended some talks in the Unsupervised learning and Anomaly Detection session, which again highlighted the difficulty of cracking this challenge. Two approaches were particularly of note: An approach that utilized a Semi-Markov model over VAR (Vector Auto Regression)-based approach to discover phases of operation over each flight and then used changes in the distributions of VAR parameters to detect anomalies in-flight. Another approach, which received the runner up best research paper award, I did not attend, but what I could glean from my colleagues, was that they use correlations between pairs of variables, (history of one variable to predict the future of another) and monitor changes to those (this is quite similar to what I have been doing), but I need to study their full method in more detail. Finally, the day 2 culminated on a much anticipated Turing lecture by Whitfield Diffie, the inventor of the Diffie-Hellman key exchange. Starting from the 14th century, Dr. Diffie gave a whirlwind tour of the evolution of the field of Cryptography and its anti-thesis, Cryptanalysis. He didn’t seem perturbed that Quantum Computing is around the corner (I do not know if it is!), and highlighted Homomorphic approaches as a promising direction for Cryptography. He ended with a very interesting question, “Does an individual have right to secrecy (from the government)?” (or does the government have the authority to get a full disclosure?). Something to ponder about given the current events in secrecy and the protection of personal information.

### Conference Day 3: 17 August 2016

The last of the conference was short but had a good start with a keynote by Nando de Freitas on recent advances in deep learning. While I was familiar with much of the work that was presented, there were many important ideas over the last 2 years that I did not have insight into: NPI, residuals, attention, identities, learning to learn. Nando provided examples from domains apart from the usual culprit: Images. The most surprising result for me was the “learning to learn gradient descent with gradient descent“, wherein an LSTM-based network was trained using gradient descent to learn the optimal way of doing a gradient descent! In the several experiments they did, this automatically discovered optimization strategy outperformed many of the well-known hand-crafted, theoretically sound and wildly popular approaches such as SGD, Nesterov’s AG, Adagrad, ADAM etc. Wow! Is this the beginning of the end of design of optimization algorithms? Only thing left is for the LSTM to now spit out a theorem that shows better guarantees than what Nesterov has spent his life on! Apart from that the day was lackluster and short. Only other mentionable work was the work by Carlos Guestrin’s team on “explaining away any classifier” with the goal of building trust in the learnt model. How do we know that the model’s accuracy is high for the right reasons? A naive approach could be just use an easily interpretable model, such as a Decision Tree, but that would mean compromising on accuracy. Complicated models are harder to explain, but more accurate. I need to study the paper, but prima facie, it appears that their strategy marries the best of both worlds. Use a complicated model to infer globally and to make predictions, but explain away using a simpler model in the local neighborhood of the test instance. I have my reservations with this approach, but at least someone is thinking about this crucial direction for the success of ML in domains that have “human experts” who will never believe a black box unless it agrees with their understanding.