Summer Internship at Lexis Nexis: 2017

Friday, August 11, 2017

Last week at LexisNexis for my Summer-17 internship and a glorious run culminates.

The plan for this week was to present at the final intern presentation at Raleigh, make some final code changes, push code for a pull request and sign off.

Day 1:

Benchmarking tests checking to improve the performance of classification.

Zero down on the source which can be parallelized.

Day 2:

The HR team at LexisNexis Raleigh arranged for a session to practice and calibrate the talk before the final presentation.

Made few changes on the slides and updated my resume adding the work done at LN Raleigh. Here it is.

Day 3:

Final presentation at LN Raleigh. The talk went well as it garnered some attention from folks in Raleigh. Here are the slides.

Meeting and Greeting other interns and seniors at LexisNexis Raleigh.

Day 4:

Final changes on parallelization based on the zero downs from Monday

Documentation for gradient boosting

Day 5:

Push code for commit based on some feedback from Dr Holt.

Final acknowledgments for everyone.

Thanks to John Holt, Lorraine Chapman and the HR team at Alpharetta and Raleigh for giving me the opportunity and infrastructure to learn and showcase my talent. Cheers!!

Friday, August 4, 2017

Week 11

The plan for this week was to present at "The Download" tech talk optimize classification. The plan was also to model data from Scopus.

Day 1:

Model data from Scopus using the updated doc2vec technique.
Results now give a common platform to query both legislative and Scopus documents.

Day 2:

Practice Talk for the tech talk
Present at the tech talk

Day 3 and 4:

Probe classification to identify regions of parallelization
Parallelize independent regressions in classification.

Day 5:

Finish few training assignments.
Prepare slides for final intern presentation.
Work on standardization.

Friday, July 28, 2017

The plan for this week was to prepare slides for "The Download" tech talk, unify categorical and continuous trees into the original push and parallelize classification. The plan was also to fetch data from Scopus.

Day 1:

Work on slides for the Tech Talk.

Day 2:

Practice Talk for the tech talk
Combine Categorical and Continuous Trees with original Gradient Boosting Trees framework

Day 3 and 4:

Scrapper to fetch data from Scopus.
Reading on projections and clustering vectors.

Day 5:

Few optimizations to code and fix code-style violations.
Finish up few touch ups for the slides for the tech talk.

Friday, July 21, 2017

Week 9

The plan for this week was to test continuous and categorical data for gradient boosting and integrate regression and classification tree on larger datasets.

Day 1:

Did a study on the dataset provided by Roger Dev. It had 5000 records, 52 attributes, and 7 classes.
Developed script to read this data into ECL and verify the algorithm.

Day 2:

On a parallel note spent this day studying Scopus data as part of my parallel project in connecting legislative and research documents.

Day 3:

The dataset took a large time to run.
Spent the rest of the day debugging the problem.
Fixed the issue.

Day 4:

Wrote a naive scrapper to parse data from Scopus.
Need to fetch large data. Planning over the weekend.

Day 5:

Started working on slides for Tech Talk on Aug 1st

Friday, July 14, 2017

Week 8

The plan for this week was to develop a decision tree to handle both continuous and categorical data for gradient boosting and integrate regression and classification tree.

Day 1:

Addressed feedbacks from my mentor on commits for Regression and Decision Trees

Day 2:

Developed the stub methods and creating generic modules for future implementations of mixed decision tree regression.

Day 3:

Combining Splitting techniques for regression and classifications

Day 4:

Implement mixed trees
Plugin mixed trees to gradient boosting
Test for Gradient Boosting using mixed trees for classification and regressions

Day 5:

Field Type generator to easily use default field types
Community Service to Food bank for 3/4th of the day.

Friday, July 7, 2017

Week 7

A relatively short week due to the 4th of July break.

Day 1

Implement tests for all the algorithms implemented over last six weeks

Day 2:

Fix few leftover tests
Send code for review to Dr Holt

Day 5:

Fix recommended changes and sent code back again.

Friday, June 30, 2017

Week 6

The plan for this week was to develop a decision tree a generic framework for gradient boosting and integrate regression and classification tree.

Day 1 and 2(Research Days):

Studying the representation of categorical data in ECL
Reformating the Gradient boosting interface to support categorical data
Generated stub methods for Gradient Boosting for ordinal data.

Day 3:

Implement partitioning function for categorical data for regression

Day 4:

Complete implementation of the Regression Tree.
Plugin to Gradient Boosting Framework

Day 5:

Testing and verifying the framework.

Friday, June 23, 2017

Week 5

The plan for this week was to develop a decision tree a generic framework for gradient boosting and integrate regression and classification tree.

Day 1 and 2:

Developed the stub methods and creating generic modules for future implementations of decision tree regression.

Day 3:

Encountered a bug where Gradient Boosting was working for regression but did not work for regression trees.
The module was working perfectly well when the Linear regression framework was plugged in but failed when the regression tree was plugged in.

Day 4:

Debugged the bug I encountered yesterday.
Such a noob mistake. While incorporating inheritance, I was using the Predict function for regression for regression tree.
Fixed it and implemented a generic framework for gradient boosting regression.

Day 5:

Similarly implemented a generic framework for gradient boosting classification.

Friday, June 16, 2017

Week 4

The plan for this week was to develop a decision tree for regression.

Day 1:

I spent the whole day studying how decision tree is implemented in ECL. It was one complex code and since I am new to ECL it was a lot to take.
I realized that the core logic of the splitting criteria is a very small part and most of the code is data transformation
Once that sunk in, I know what to do.

Day 2:

The choice to choose the splitting criteria was the next challenge.
The simplest decision tree to implement I think is the standard deviation based ID3 algorithm and I spent the day reading about it.
I also implemented the stub methods for the decision tree.

Day 3:

The day was spent implementing the splitting criteria and completed it.

Day 4:

The first half of the day was spent attending a talk by Jamie Buckley, the Chief Product Officer.
He gave a nice talk on his journey in the tech industry and all the cutting edge technology at Lexis Nexis.
I was really impressed by the work in Machine Learning especially the Socrates project.
I spent the rest the day, completing the learning and prediction functions.

Day 5:

There was a small bug in the predict function and I spent some time fixing it.
I implemented a small test for verifying the implementation of Decision Tree and compared it against Linear Regression.

Friday, June 9, 2017

Week 3

The plan for this week was to develop a gradient boosting for classification.

Day 1:

I developed a test rig to test out gradient boosting for linear regression. Got a lot of pointers from Vivek Nair's code on a generic test suite for classification.

Day 2:

Rather simple day. I set up the basic skeleton and stubs for classification in gradient boosting.
I also had a call with Satya who gave me access to large legal data to run my text mining experiments on

Day 3:

I ended up with a strange bug. In ecl, the variable cannot have the same name as an attribute. I was getting unexpected results while filtering and this was the reason. Vivek again helped me debug it.
I completed implementation of gradient boosting using classification.

Day 4:

I wrote test cases for classification.
I went through the data give by Satya and tried making sense out of it.

Day 5:

I set up the environment to perform some text mining.
Downloaded the data set and saved it in MongoDB.
Has over 1.5 million documents.

Friday, June 2, 2017

Week 2

Week 2(Getting the feet wet)

Back to work on Tuesday. My conference went well and made some great research contacts.

Day 1:

My Desktop finally arrived so I had to set it up. Big task; since I was so used to the UNIX ecosystem that I actually forgot how Windows feels.
Never the less spent the day setting up the environment and finishing all the training tutorials

Day 2:

Started studying the effect of different losses on computing the gradient.
Came across a new loss function that accounts for managing outliers. Would recommend it to all(Huber Loss)

Day 3:

Realized that regression was required for both classification and regression based gradient boosting. In the case of classification methods, the gradient is computed on the probability of a class. Thus, we need regression methods to power classification methods.
Regression Tree is not implemented in ECL. I might have to implement this and I am excited about this prospect.
Thought of building a proof of concept using linear regression classifier.
Set up stub methods for Gradient Boosting Classifiers

Day 4:

Completed the stub methods as planned.
Committed source code to github.

Thursday, June 1, 2017

Week 1

Its really strange that it was my first week and I was only going to be there for only the first day. I had a paper to present at International Conference on Software Engineering(ICSE 2017) and had to be out of town. So I was working remotely for the first week.

Week 0:

Since I was off for the first week I decided to spend some time doing stuff that you generally do on your first week.

I went through the manual for ECL and ECL-ML.
Setup the environment on my local machine.

This was a hard one. Initially, I thought of setting up on my Mac but the support is not great for ECLIDE on a mac.
Alternatively I could to code in a browser or command line which would be running in a VM.
The best option is do things the way it was built to be done. ECL development is typically on a windows machine.
So I installed windows on my machine using BOOTCAMP and followed the setup instructions on HPCCSystems website. It worked seamlessly.

I played around the ECL environment and finished the tutorials and devised my plan on approaching the project.

Week 1:

First day is always great. There was a great induction program with a bunch of there interns. Swag and Free Food. Small tour of the building, some paperwork and I was done for the day(for the week :P). My desk space was really great. I appreciate the initiative of my mentors and the support of the HR in both Raleigh and Alpharetta in creating a great work environment for me.

Data is required for any machine learning task. During my first week I was coding my first real program in ECL and I developed a continuous and discrete dataset from the UCI machine learning repository to test my algorithms. I thought this would be easy but like most languages ECL does have a sharp learning curve.

Applying & Joining

I was pointed to the internship by Trish McCall.
About the Program:

HpccSystems is a state of the art solution for the high performance computing and they have numerous open summer projects to choose from(https://goo.gl/ePbAcR).
Its 10-12 week program based on your availability.
I was interested in the ECL projects and particularly ECL-ML since machine learning was part of my research and more importantly I get to implement a state of the art machine learning algorithm for an industrial application.
My attention was drawn to the Gradient Boosting Algorithms project. The prime requirement was to implement Gradient Boosting Trees for Classification and Regression.
I wrote a 3 page detailed proposal and sent it to the Project Manager(Dr John Holt) and the Intern Program Manager(Lorraine Chapman).
Luckily, they liked my proposal and offered me a position over the summer.
The program is really flexible and LN offers you a choice of either working from their Alpharetta office or from home. I chose home since to work from home as there was a Research Paper I am working on over the summer and need access to my lab. Lorraine was very supportive on this.
As part of my research I also needed to access some data from LN servers which required me to be a physically present in an LN office. Lorraine, Dr Villanustre and Dr Holt were very supportive and they arranged a desk space for me in their Raleigh office.

Thus, I was also set to start my work @ LexisNexis for Summer 2017.

Friday, August 11, 2017

Last Week

Day 4: Final changes on parallelization based on the zero downs from Monday Documentation for gradient boosting

Day 5: Push code for commit based on some feedback from Dr Holt. Final acknowledgments for everyone. Thanks to John Holt, Lorraine Chapman and the HR team at Alpharetta and Raleigh for giving me the opportunity and infrastructure to learn and showcase my talent. Cheers!!

Friday, August 4, 2017

Week 11

Friday, July 28, 2017

Week 10

Friday, July 21, 2017

Week 9

Friday, July 14, 2017

Week 8

Friday, July 7, 2017

Week 7

Friday, June 30, 2017

Week 6

Friday, June 23, 2017

Week 5

Friday, June 16, 2017

Week 4

Friday, June 9, 2017

Week 3

Friday, June 2, 2017

Week 2(Getting the feet wet)

Thursday, June 1, 2017

Week 1

Week 0:

Week 1:

Applying & Joining

Day 4:

Final changes on parallelization based on the zero downs from Monday

Documentation for gradient boosting

Day 5:

Push code for commit based on some feedback from Dr Holt.

Final acknowledgments for everyone.

Thanks to John Holt, Lorraine Chapman and the HR team at Alpharetta and Raleigh for giving me the opportunity and infrastructure to learn and showcase my talent. Cheers!!