Summer Internship at Lexis Nexis: June 2017

Friday, June 30, 2017

Week 6

The plan for this week was to develop a decision tree a generic framework for gradient boosting and integrate regression and classification tree.

Day 1 and 2(Research Days):

Studying the representation of categorical data in ECL
Reformating the Gradient boosting interface to support categorical data
Generated stub methods for Gradient Boosting for ordinal data.

Day 3:

Implement partitioning function for categorical data for regression

Day 4:

Complete implementation of the Regression Tree.
Plugin to Gradient Boosting Framework

Day 5:

Testing and verifying the framework.

Friday, June 23, 2017

Week 5

The plan for this week was to develop a decision tree a generic framework for gradient boosting and integrate regression and classification tree.

Day 1 and 2:

Developed the stub methods and creating generic modules for future implementations of decision tree regression.

Day 3:

Encountered a bug where Gradient Boosting was working for regression but did not work for regression trees.
The module was working perfectly well when the Linear regression framework was plugged in but failed when the regression tree was plugged in.

Day 4:

Debugged the bug I encountered yesterday.
Such a noob mistake. While incorporating inheritance, I was using the Predict function for regression for regression tree.
Fixed it and implemented a generic framework for gradient boosting regression.

Day 5:

Similarly implemented a generic framework for gradient boosting classification.

Friday, June 16, 2017

Week 4

The plan for this week was to develop a decision tree for regression.

Day 1:

I spent the whole day studying how decision tree is implemented in ECL. It was one complex code and since I am new to ECL it was a lot to take.
I realized that the core logic of the splitting criteria is a very small part and most of the code is data transformation
Once that sunk in, I know what to do.

Day 2:

The choice to choose the splitting criteria was the next challenge.
The simplest decision tree to implement I think is the standard deviation based ID3 algorithm and I spent the day reading about it.
I also implemented the stub methods for the decision tree.

Day 3:

The day was spent implementing the splitting criteria and completed it.

Day 4:

The first half of the day was spent attending a talk by Jamie Buckley, the Chief Product Officer.
He gave a nice talk on his journey in the tech industry and all the cutting edge technology at Lexis Nexis.
I was really impressed by the work in Machine Learning especially the Socrates project.
I spent the rest the day, completing the learning and prediction functions.

Day 5:

There was a small bug in the predict function and I spent some time fixing it.
I implemented a small test for verifying the implementation of Decision Tree and compared it against Linear Regression.

Friday, June 9, 2017

Week 3

The plan for this week was to develop a gradient boosting for classification.

Day 1:

I developed a test rig to test out gradient boosting for linear regression. Got a lot of pointers from Vivek Nair's code on a generic test suite for classification.

Day 2:

Rather simple day. I set up the basic skeleton and stubs for classification in gradient boosting.
I also had a call with Satya who gave me access to large legal data to run my text mining experiments on

Day 3:

I ended up with a strange bug. In ecl, the variable cannot have the same name as an attribute. I was getting unexpected results while filtering and this was the reason. Vivek again helped me debug it.
I completed implementation of gradient boosting using classification.

Day 4:

I wrote test cases for classification.
I went through the data give by Satya and tried making sense out of it.

Day 5:

I set up the environment to perform some text mining.
Downloaded the data set and saved it in MongoDB.
Has over 1.5 million documents.

Friday, June 2, 2017

Week 2(Getting the feet wet)

Back to work on Tuesday. My conference went well and made some great research contacts.

Day 1:

My Desktop finally arrived so I had to set it up. Big task; since I was so used to the UNIX ecosystem that I actually forgot how Windows feels.
Never the less spent the day setting up the environment and finishing all the training tutorials

Day 2:

Started studying the effect of different losses on computing the gradient.
Came across a new loss function that accounts for managing outliers. Would recommend it to all(Huber Loss)

Day 3:

Realized that regression was required for both classification and regression based gradient boosting. In the case of classification methods, the gradient is computed on the probability of a class. Thus, we need regression methods to power classification methods.
Regression Tree is not implemented in ECL. I might have to implement this and I am excited about this prospect.
Thought of building a proof of concept using linear regression classifier.
Set up stub methods for Gradient Boosting Classifiers

Day 4:

Completed the stub methods as planned.
Committed source code to github.

Thursday, June 1, 2017

Its really strange that it was my first week and I was only going to be there for only the first day. I had a paper to present at International Conference on Software Engineering(ICSE 2017) and had to be out of town. So I was working remotely for the first week.

Week 0:

Since I was off for the first week I decided to spend some time doing stuff that you generally do on your first week.

I went through the manual for ECL and ECL-ML.
Setup the environment on my local machine.

This was a hard one. Initially, I thought of setting up on my Mac but the support is not great for ECLIDE on a mac.
Alternatively I could to code in a browser or command line which would be running in a VM.
The best option is do things the way it was built to be done. ECL development is typically on a windows machine.
So I installed windows on my machine using BOOTCAMP and followed the setup instructions on HPCCSystems website. It worked seamlessly.

I played around the ECL environment and finished the tutorials and devised my plan on approaching the project.

Week 1:

First day is always great. There was a great induction program with a bunch of there interns. Swag and Free Food. Small tour of the building, some paperwork and I was done for the day(for the week :P). My desk space was really great. I appreciate the initiative of my mentors and the support of the HR in both Raleigh and Alpharetta in creating a great work environment for me.

Data is required for any machine learning task. During my first week I was coding my first real program in ECL and I developed a continuous and discrete dataset from the UCI machine learning repository to test my algorithms. I thought this would be easy but like most languages ECL does have a sharp learning curve.

Applying & Joining

I was pointed to the internship by Trish McCall.
About the Program:

HpccSystems is a state of the art solution for the high performance computing and they have numerous open summer projects to choose from(https://goo.gl/ePbAcR).
Its 10-12 week program based on your availability.
I was interested in the ECL projects and particularly ECL-ML since machine learning was part of my research and more importantly I get to implement a state of the art machine learning algorithm for an industrial application.
My attention was drawn to the Gradient Boosting Algorithms project. The prime requirement was to implement Gradient Boosting Trees for Classification and Regression.
I wrote a 3 page detailed proposal and sent it to the Project Manager(Dr John Holt) and the Intern Program Manager(Lorraine Chapman).
Luckily, they liked my proposal and offered me a position over the summer.
The program is really flexible and LN offers you a choice of either working from their Alpharetta office or from home. I chose home since to work from home as there was a Research Paper I am working on over the summer and need access to my lab. Lorraine was very supportive on this.
As part of my research I also needed to access some data from LN servers which required me to be a physically present in an LN office. Lorraine, Dr Villanustre and Dr Holt were very supportive and they arranged a desk space for me in their Raleigh office.

Thus, I was also set to start my work @ LexisNexis for Summer 2017.