Friday, June 30, 2017

Week 6

Week 6



The plan for this week was to develop a decision tree a generic framework for gradient boosting and integrate regression and classification tree.

Day 1 and 2(Research Days):
  • Studying the representation of categorical data in ECL
  • Reformating the Gradient boosting interface to support categorical data
  • Generated stub methods for Gradient Boosting for ordinal data.
Day 3:
  •  Implement partitioning function for categorical data for regression
Day 4: 
  • Complete implementation of the Regression Tree.
  • Plugin to Gradient Boosting Framework
Day 5:
  • Testing and verifying the framework.

Friday, June 23, 2017

Week 5

Week 5



The plan for this week was to develop a decision tree a generic framework for gradient boosting and integrate regression and classification tree.

Day 1 and 2:
  • Developed the stub methods and creating generic modules for future implementations of decision tree regression.
Day 3:
  • Encountered a bug where Gradient Boosting was working for regression but did not work for regression trees.
  • The module was working perfectly well when the Linear regression framework was plugged in but failed when the regression tree was plugged in.

Day 4: 
  • Debugged the bug I encountered yesterday. 
  • Such a noob mistake. While incorporating inheritance, I was using the Predict function for regression for regression tree.
  • Fixed it and implemented a generic framework for gradient boosting regression.
Day 5:
  • Similarly implemented a generic framework for gradient boosting classification.

Friday, June 16, 2017

Week 4

Week 4



The plan for this week was to develop a decision tree for regression.

Day 1:
  • I spent the whole day studying how decision tree is implemented in ECL. It was one complex code and since I am new to  ECL it was a lot to take.
  • I realized that the core logic of the splitting criteria is a very small part and most of the code is data transformation
  • Once that sunk in, I know what to do.

Day 2:
  • The choice to choose the splitting criteria was the next challenge.
  • The simplest decision tree to implement I think is the standard deviation based ID3 algorithm and I spent the day reading about it.
  • I also implemented the stub methods for the decision tree.

Day 3:
  • The day was spent implementing the splitting criteria and completed it.

Day 4: 
  • The first half of the day was spent attending a talk by Jamie Buckley, the Chief Product Officer.
  • He gave a nice talk on his journey in the tech industry and all the cutting edge technology at Lexis Nexis.
  • I was really impressed by the work in Machine Learning especially the Socrates project.
  • I spent the rest the day, completing the learning and prediction functions.

Day 5:
  • There was a small bug in the predict function and I spent some time fixing it.
  • I implemented a small test for verifying the implementation of Decision Tree and compared it against Linear Regression.

Friday, June 9, 2017

Week 3

Week 3



The plan for this week was to  develop a gradient boosting for classification.

Day 1:
Day 2:
  • Rather simple day. I set up the basic skeleton and stubs for classification in gradient boosting.
  • I also had a call with Satya who gave me access to large legal data to run my text mining experiments on
Day 3:
  • I ended up with a strange bug. In ecl, the variable cannot have the same name as an attribute. I was getting unexpected results while filtering and this was the reason. Vivek again helped me debug it.
  • I completed implementation of gradient boosting using classification.
Day 4: 
  • I wrote test cases for classification.
  • I went through the data give by Satya and tried making sense out of it.
Day 5:
  • I set up the environment to perform some text mining.
  • Downloaded the data set and saved it in MongoDB. 
  • Has over 1.5 million documents.

Friday, June 2, 2017

Week 2

Week 2(Getting the feet wet)



Back to work on Tuesday. My conference went well and made some great research contacts. 

Day 1:
  • My Desktop finally arrived so I had to set it up. Big task; since I was so used to the UNIX ecosystem that I actually forgot how Windows feels.
  • Never the less spent the day setting up the environment and finishing all the training tutorials

Day 2:
  •  Started studying the effect of different losses on computing the gradient. 
  • Came across a new loss function that accounts for managing outliers. Would recommend it to all(Huber Loss)

Day 3:
  • Realized that regression was required for both classification and regression based gradient boosting. In the case of classification methods, the gradient is computed on the probability of a class. Thus, we need regression methods to power classification methods.
  • Regression Tree is not implemented in ECL. I might have to implement this and I am excited about this prospect.
  • Thought of building a proof of concept using linear regression classifier.
  • Set up stub methods for Gradient Boosting Classifiers

Day 4: 
  • Completed the stub methods as planned.
  • Committed source code to github.

Thursday, June 1, 2017

Week 1

Week 1

Its really strange that it was my first week and I was only going to be there for only the first day. I had a paper to present at International Conference on Software Engineering(ICSE 2017) and had to be out of town. So I was working remotely for the first week. 

Week 0:

Since I was off for the first week I decided to spend some time doing stuff that you generally do on your first week. 
  • I went through the manual for ECL and ECL-ML.
  • Setup the environment on my local machine.
    • This was a hard one. Initially, I thought of setting up on my Mac but the support is not great for ECLIDE on a mac. 
    • Alternatively I could to code in a browser or command line which would be running in a VM. 
    • The best option is do things the way it was built to be done. ECL development is typically on a windows machine.
    • So I installed windows on my machine using BOOTCAMP and followed the setup instructions on HPCCSystems website. It worked seamlessly.
  • I played around the ECL environment and finished the tutorials and devised my plan on approaching the project.

Week 1:

First day is always great. There was a great induction program with a bunch of there interns. Swag and Free Food. Small tour of the building, some paperwork and I was done for the day(for the week :P). My desk space was really great. I appreciate the initiative of my mentors and the support of the HR in both Raleigh and Alpharetta in creating a great work environment for me.

Data is required for any machine learning task. During my first week I was coding my first real program in ECL and I developed a continuous and discrete dataset from the UCI machine learning repository to test my algorithms. I thought this would be easy but like most languages ECL does have a sharp learning curve.



Applying & Joining

Applying & Joining


I was pointed to the internship by Trish McCall.
About the Program:
  • HpccSystems is a state of the art solution for the high performance computing and they have numerous open summer projects to choose from(https://goo.gl/ePbAcR).  
  • Its 10-12 week program based on your availability.
  • I was interested in the ECL projects and particularly ECL-ML since machine learning was part of my research and more importantly I get to implement a state of the art machine learning algorithm for an industrial application.
  • My attention was drawn to the Gradient Boosting Algorithms project. The prime requirement was to implement Gradient Boosting Trees for Classification and Regression.
  • I wrote a 3 page detailed proposal and sent it to the Project Manager(Dr John Holt) and the Intern Program Manager(Lorraine Chapman). 
  • Luckily, they liked my proposal and offered me a position over the summer.
  • The program is really flexible and LN offers you a choice of either working from their Alpharetta office or from home. I chose home since to work from home as there was a Research Paper I am working on over the summer and need access to my lab. Lorraine was very supportive on this.
  • As part of my research I also needed to access some data from LN servers which required me to be a physically present in an LN office. Lorraine, Dr Villanustre and Dr Holt were very supportive and they arranged a desk space for me in their Raleigh office.
Thus, I was also set to start my work @ LexisNexis for Summer 2017.