Monday, May 22, 2017

Load Data into Azure DW using C# in an SSIS script task

Now there are a lot of reasons why SSIS needs to be leveraged for loading data into the Azure DW platform. Even though Polybase and Azure Data factory are the core criteria's, here are the templates for the SSIS script task that were leveraged to load data (full and incremental into Azure DW) for a specific customer rather than using the data flow task:
SSIS Full Load script:
SSIS Incremental Load script:

There are a few reasons for this approach and one of them being that the existing package was using a similar structure and one not to be deviated from. The other being that some key logging aspects needed to be handled in a Legacy platform that could not be decommissioned at that time.

Tuesday, May 09, 2017

Why Azure might overtake AWS in its data services offerings

  1. Compliance

Microsoft enterprise cloud services are independently validated through certifications and attestations, as well as third-party audits. In-scope services within the Microsoft Cloud meet key international and industry-specific compliance standards, such as ISO/IEC 27001 and ISO/IEC 27018, FedRAMP, and SOC 1 and SOC 2. They also meet regional and country-specific standards and contractual commitments, including the EU Model Clauses, UK G-Cloud, Singapore MTCS, and Australia CCSL (IRAP). In addition, rigorous third-party audits, such as by the British Standards Institution and Deloitte, validate the adherence of our cloud services to the strict requirements these standards mandate.

  1. Security

With Security tightly ingrained with its AD offerings, Microsoft currently continues to evolve its security  and data integrity in Azure. Core advantages of Security in Azure are as follows:

  1. Tightly integrated with Windows Active Directory
  2. Simplified cloud access using Single Sign On
  3. Single and Multi Factor authentication support
  4. Rich protocols  Eg: Federated Authentication (WS-FEDERATION), SAML, OAuth 2.0 (Version 1.0 still supported), Open ID Connect, Graph API, Web API 2.0 (in conjunction with Authentication_JSON_AppService.axd and Authorize attribute)

  1. Data/BI Offerings in the Cloud

Microsoft Azure relatively has all the required components to support data related and business intelligence in various formats.

The core database and data collection/integration formats in Azure are as follows:

  • Data Factory

Data Factory is a cloud-based data integration service that orchestrates and automates the movement and transformation of data. You can create data integration solutions using the Data Factory service that can ingest data from various data stores, transform/process the data, and publish the result data to the data stores.+

Data Factory service allows you to create data pipelines that move and transform data, and then run the pipelines on a specified schedule (hourly, daily, weekly, etc.). It also provides rich visualizations to display the lineage and dependencies between your data pipelines, and monitor all your data pipelines from a single unified view to easily pinpoint issues and setup monitoring alerts.

  • SQL Server Integration Services (SSIS)

Leveraging SSIS which is the core tool leveraged for discontinued development across various teams, one can move data into and out of Azure to on premise or other cloud environments based on one's needs. SSIS can integrate databases and data warehouses in the Azure cloud and also enable individuals to drive templated based development efforts with ease.

Typically compared with AWS data pipeline:
AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premise data sources, at specified intervals. With AWS Data Pipeline, you can regularly access your data where it’s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon EMR.
A pipeline schedules and runs tasks. You upload your pipeline definition to the pipeline, and then activate the pipeline. You can edit the pipeline definition for a running pipeline and activate the pipeline again for it to take effect. You can deactivate the pipeline, modify a data source, and then activate the pipeline again. When you are finished with your pipeline, you can delete it. Task Runner polls for tasks and then performs those tasks. For example, Task Runner could copy log files to Amazon S3 and launch Amazon EMR clusters. Task Runner is installed and runs automatically on resources created by your pipeline definitions. You can write a custom task runner application, or you can use the Task Runner application that is provided by AWS Data Pipeline.

  • Azure SQL Data Warehouse

The SQL data warehouse for Azure maintains storage for data in the realm of files and blobs. This allows the easy conjunction between dimensional data and measures within the platform. Core documentation can be found here:

Typically Compared with AWS Redshift:
Redshift supports two kinds of sort keys: compound and interleaved. A compound sort key a combination of multiple columns, one primary column and one or more secondary columns. A compound sort key helps with joins and where conditions; however, the performance drops when the query is only on secondary columns without the primary column. A compound sort key is the default sort type. In interleaved an sort, each column is given an equal weight. Both compound and interleaved require a re-index to keep the query performance level high. The architecture of these systems differ but the end goal is the storage and processing of vast amounts of data down to second or milli second based result generation. Focusing on this aspect I am going to give a more detailed insight on Redshift which is a node based peta byte scaled database as well as a high level overview of what I recently implemented.
Note: The above diagram is from the Redshift Warehousing article (

If you pay close attention to the diagram above the compute nodes is responsible for all the data processing and query transactions on the database and the data nodes contain the actual node slices. The Leader Node is the service that monitors the data connections against the Redshift cluster and is also responsible for the query processing (almost like a syntax checker on the query and functions leveraged). It then transfers the query across to the Compute Nodes whose main responsibility is to find the data slices from the data nodes and communicate with one another to determine the way the transaction needs to be executed. It is similar to a job broker except that this is more real time than non real time. 

It is similar to the analogy of using a bucket..... Consider this:

You take a bucket and keep filling water into it, eventually the bucket get filled..... however what happens when there is an enormous amount of water that needs to be contained. Either grab a massive bucket or use multiple buckets to store the water (so the second option actually depicts the Redshift architecture....)

The concept  scaling up implies not only adding a bucket for storage but also a mechanism to ensure that the pipeline flow goes to an empty bucket which is nothing but out compute node. But there is the price of the bucket and the cost of the mechanism that is required to populate the bucket as well....

The cons of using Redshift are as follows:
  • You need an Administrator to monitor the  Redshift cluster at all times
  • The power of Redshift lies in the manner the database is queried, so a SQL developer/DBA with understanding of the Redshift internals is definitely a must
  • Upfront cost is slightly on the higher side but over a period in time the cost will be justified with more nodes being added to the cluster

Redshift major advantage is the fact that it allows both JDBC and ODBC drivers so definitely if you want to query the database. The Postgres driver can be found as follows:

  • Azure Redis Cache
Azure Redis Cache is based on the popular open-source Redis cache. It gives you access to a secure, dedicated Redis cache, managed by Microsoft and accessible from any application within Azure.
Azure Redis Cache is available in the following tiers:
Basic—Single node, multiple sizes, ideal for development/test and non-critical workloads. The basic tier has no SLA.
Standard—A replicated cache in a two node Primary/Secondary configuration managed by Microsoft, with a high availability SLA.
Premium—The new Premium tier includes a high availability SLA and all the Standard-tier features and more, such as better performance over Basic or Standard-tier Caches, bigger workloads, disaster recovery, and enhanced security. Additional features include:
Redis persistence allows you to persist data stored in Redis cache. You can also take snapshots and back up the data which you can load in case of a failure.
Redis cluster automatically shards data across multiple Redis nodes, so you can create workloads of bigger memory sizes (greater than 53 GB) and get better performance.
Azure Virtual Network (VNET) deployment provides enhanced security and isolation for your Azure Redis Cache, as well as subnets, access control policies, and other features to further restrict access.
Basic and Standard caches are availabe in sizes up to 53 GB, and Premium caches are available in sizes up to 530 GB with more on request.
Azure Redis Cache helps your application become more responsive even as user load increases. It leverages the low-latency, high-throughput capabilities of the Redis engine. This separate, distributed cache layer allows your data tier to scale independently for more efficient use of compute resources in your application layer.
Unlike traditional caches which deal only with key-value pairs, Redis is popular for its highly performant data types. Redis also supports running atomic operations on these types, like appending to a string; incrementing the value in a hash; pushing to a list; computing set intersection, union and difference; or getting the member with highest ranking in a sorted set. Other features include support for transactions, pub/sub, Lua scripting, keys with a limited time-to-live, and configuration settings to make Redis behave more like a traditional cache.
Another key aspect to Redis success is the healthy, vibrant open source ecosystem built around it. This is reflected in the diverse set of Redis clients available across multiple languages. This allows it to be used by nearly any workload you would build inside of Azure.
Advantages -
Elasticache Redis:
Simple query language - no complex features
Is (out-of-the-box) not reachable from other regions.
You're always limited to the maount of memory
Sharding over multiple instances is only possible within your application - Redis doesn't do anything here
You pay per instance no matter how the load or the number of requests are.
If you want redundancy of the data you need to setup replication (not possible between different regions)
You need to work yourself for high availability
*Redis/memcached are in-memory stores and should generally be faster than DynamoDB 
*Exposed as a API

Typically compared with AWS DynamoDB
In DynamoDB, data is partitioned automatically by its hash key. That’s why you will need to choose a hash key if you’re implementing a GSI. The partitioning logic depends upon two things: table size and throughput.
DynamoDB supports following data types:
Scalar – Number, String, Binary, Boolean, and Null.
Multi-valued – String Set, Number Set, and Binary Set.
Document – List and Map.
Has a query language which is able to do more complex things (greater than, between etc.)
Is reachable via an external internet-facing API (different regions are reachable without any changes or own infrastructure)
Permissions based on tables or even rows are possible
Scales in terms of data size to infinity
You pay per request -> low request numbers means smaller bill, high request numbers means higher bill
Reads and Writes are different in costs
Data is saved redundant by AWS in multiple facilities
DynamoDB is highly available out-of-the-box
  • Azure Document DB
Planet-scale, highly-available NoSQL database service on Microsoft Azure. Azure DocumentDB is Microsoft’s multi-tenant distributed database service for managing JSON documents at Internet scale. DocumentDB is now generally available to Azure developers. In this paper, we describe the DocumentDB indexing subsystem. DocumentDB indexing enables automatic indexing of documents without requiring a schema or secondary indices. Uniquely, DocumentDB provides real-time consistent queries in the face of very high rates of document updates. As a multi-tenant service, DocumentDB is designed to operate within extremely frugal resource budgets while providing predictable performance and robust resource isolation to its tenants. This paper describes the DocumentDB capabilities, including document representation, query language, document indexing approach, core index support, and early production experiences
Azure DocumentDB guarantees less than 10 ms latencies on reads and less than 15 ms latencies on writes for at least 99% of requests. DocumentDB leverages a write-optimized, latch-free database engine designed for high-performance solid-state drives to run in the cloud at global scale. Read and write requests are always served from your local region while data can be distributed to other regions across the globe. Scale data throughput and storage independently and elastically—not just within one region, but across many geographically distributed regions. Add capacity to serve millions of requests per second at a fraction of the cost of other popular NoSQL databases.
Easily build apps at planet scale without the hassle of complex, multiple data center configurations. Designed as a globally distributed database system, DocumentDB automatically replicates all of your data to any number of regions worldwide. Your apps can serve data from a region closest to your users for fast, uninterrupted access.
Query using familiar SQL and JavaScript syntax over document and key-value data without dealing with schema or secondary indices, ever. Azure DocumentDB is a truly schema-agnostic database capable of automatically indexing JSON documents. Define your business logic as stored procedures, triggers, and user-defined functions entirely in JavaScript, and have them executed directly inside the database engine. Standardized SLA's for infrastructure throughpout

Typically compared with AWS DynamoDB

Azure HDInsight

Easily spin up enterprise-grade, open source cluster types, guaranteed with the industry’s best 99.9% SLA and 24/7 support. Our SLA covers your entire Azure big data solution, not just the virtual machine instances. HDInsight is architected for full redundancy and high availability, including head node replication, data geo-replication, and built-in standby NameNode, making HDInsight resilient to critical failures not addressed in standard Hadoop implementations. Azure also offers cluster monitoring and 24/7 enterprise support backed by Microsoft and Hortonworks with 37 combined committers for Hadoop core—more than all other managed cloud providers combined—ready to support your deployment with the ability to fix and commit code back to Hadoop.  Use rich productivity suites for Hadoop and Spark with your preferred development environment such as Visual StudioEclipse, and IntelliJ for Scala, Python, R, Java, and .Net support. Data scientists gain the ability to combine code, statistical equations, and visualizations to tell a story about their data through integration with the two most popular notebooks, Jupyter and Zeppelin. HDInsight is also the only managed cloud Hadoop solution with integration to Microsoft R Server. Multi-threaded math libraries and transparent parallelization in R Server handle up to 1000x more data and up to 50x faster speeds than open source R, helping you train more accurate models for better predictions than previously possible.

A thriving market of independent software vendors (ISVs) provide value-added solutions across the broader ecosystem of Hadoop. Because every cluster is extended with edge nodes and script action, HDInsight lets you spin up Hadoop and Spark clusters that are pre-integrated and pre-tuned with any ISV application out-of-the-box, including Datameer, Cask, AtScale, and StreamSets.)

Integration with the Azure Load Balancer:

Azure Load Balancer delivers high availability and network performance to your applications. It is a Layer 4 (TCP, UDP) load balancer that distributes incoming traffic among healthy instances of services defined in a load-balanced set.

Azure Load Balancer can be configured to:

  • Load balance incoming Internet traffic to virtual machines. This configuration is known as Internet-facing load balancing.
  • Load balance traffic between virtual machines in a virtual network, between virtual machines in cloud services, or between on-premises computers and virtual machines in a cross-premises virtual network. This configuration is known as internal load balancing.
  • Forward external traffic to a specific virtual machine.

All resources in the cloud need a public IP address to be reachable from the Internet. The cloud infrastructure in Azure uses non-routable IP addresses for its resources. Azure uses network address translation (NAT) with public IP addresses to communicate to the Internet.

*not similar to the elastic emr load balancer

Shows live changes in analytics.

Azure in itself is very user-friendly, HDInsight is a great addition.

Works in tandem with other Microsoft technologies like Power BI and Excel seamlessly as end clients

When loading large volumes of data, issues in data might stop the load mid way due to either data corruption errors issues in data might stop the load mid way due to either data corruption errors or some latency issue. The entire load needs to be tranisitioned from start.

  • Azure HDInsight is a service that provisions Apache Hadoop in the Azure cloud, providing a software framework designed to manage, analyze and report on big data.
  • HDInsight clusters are configured to store data directly in Azure Blob storage, which provides low latency and increased elasticity in performance and cost choices.
  • Unlike the first edition of HDInsight , now it is delivered on Linux – as Hadoop should be, which means access to to HDP features. The cluster can be accessed via Ambari in the web browser, or directly via SSH.
  • HDInsight has always been an elastic platform for data processing. In today’s platform, it’s even more scalable. Not only can nodes be added and removed from a running cluster, but individual node size can be controlled which means the cluster can be highly optimized to run the specific jobs that are scheduled.
  • In its initial form, there were many options for developing HDInsight processing jobs. Today, however, there are really great options available that enable developers to build data processing applications in whatever environment they prefer. For Windows developers, HDInsight has a rich plugin for Visual Studio that supports the creation of Hive, Pig, and Storm applications. For Linux or Windows developers, HDInsight has plugins for both IntelliJ IDEA and Eclipse, two very popular open-source Java IDE platforms. HDInsight also supports PowerShell, Bash, and Windows command inputs to allow for scripting of job workflows.

Typically Compared with AWS EMR

Amazon EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. You can also run other popular distributed frameworks such as Apache SparkHBasePresto, and Flink in Amazon EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB.

Amazon EMR securely and reliably handles a broad set of big data use cases, including log analysis, web indexing, data transformations (ETL), machine learning, financial analysis, scientific simulation, and bioinformatics.

AWS EMR is more mature when compared to HDInsight however HDInsight development continues to progress at a rapid pace compared to the phased approach for AWS EMR.

AWS EMR pricing is also higher than Azure HDInsight in terms of storage usage.

EMR release velocity is better than Azure HDInsight.

  • Amazon EMR provides a managed Hadoop framework that simplifies big data processing.
  • Other popular distributed frameworks such as Apache Spark and Presto can also be run in Amazon EMR.
  • Pricing of Amazon EMR is simple and predictable: Payment can be done on hourly rate. A 10-node Hadoop can be launched for as little as $0.15 per hour. Because Amazon EMR has native support for Amazon EC2 Spot and Reserved Instances, 50-80% can also be saved on the cost of the underlying instances.
  • It also is in vogue due to its easy usage capability. When a cluster is launched on Amazon EMR the web service allocates the virtual server instances and configures them with the needed software for you. Within minutes you can have a cluster configured and ready to run your Hadoop application.
  • It is resizable, the number of virtual clusters depending on the processing needs can be easily contracted or expanded.
  • Amazon EMR integrates with popular business intelligence (BI) tools such as Tableau, MicroStrategy, and Datameer. For more information, see Use Business Intelligence Tools with Amazon EMR.
  • You can run Amazon EMR in a Amazon VPC in which you configure networking and security rules. Amazon EMR also supports IAM users and roles which you can use to control access to your cluster and permissions that restrict what others can do on the cluster. For more information, see Configure Access to the Cluster.

Wednesday, November 30, 2016

Modern Day Messaging Patterns

Its been a few days now and I have been focused on understanding modern day messaging patterns for a problem I am trying to solve. I do know that there are existing server side tools like Active MQ, Rabbit MQ and even WMS that can do the trick and already have pre-defined patterns tested and validated for performance and security but in this case even though I am not trying to reinvent the wheel in terms of creating a new pattern or any of these server side products, I am definitely trying to understand the manner in which these products have been created and if I can actually leverage some of the principles in a server side application I am writing up. For example in modern based web application development, if .Net based, you have patterns like the one's defined here: Microsoft SOA patterns that do the neat tricks you would need. Man at times I feel I am going at 300 miles an hour without any crash guards: Code reviews, Custom product development, Customer Engagements, Team management, generating pipelines for my practices, reviewing and helping the team in analyzing problems, Big Data Workloads, Machine Learning,  my data science degree  etc... have consumed majority of my life for the past 2.5 years. Thank goodness I am going to be taking an extended break early next year for my brother's wedding. Here are the gist of things I am trying to solve though:

1. Dynamic reformatting of my messages
2. Integration with Machine Learning to collaborate with specific scientic models (pre created)
3. Templated messaging with dynamic parameterization
4. Distribution channel modification
5. AI Hub and Spoke messaging relay

Wednesday, October 19, 2016

Power BI To Embed Or Not To Embed

It is very critical for organizations to work & play with data. Power BI - the reporting solution from Microsoft is literally scorching the market with its rapid pace in usage. On a quick note while interacting with your Power BI report like the following-->

This report is accessible by the public. In order to create a more personalized/advanced security reporting structure with Power BI, the Power BI embedded would be the way to go. Create a workspace collection in Azure and then generate the required API keys (two by default - primary and secondary). These API keys will be leveraged by your web application. Once this is done create the pbix solution file in your desktop tool and publish or import the pbix solution to the Azure workspace using powershell/C#/ruby/java etc... Now to interact with the pbix file in your application, you need to leverage the Power BI Embed API's. However there is another approach using Power BI API's instead of the embed API's. The embed API's is a pay as you go service where you are using the Page views determining the pricing and the other option. The manner in which an embed Power BI report acts vs a straight forward Power BI report is that you are registering the Power BI workspace in the case of an embedded report .
 But in a straight forward Power BI report, you will be publishing the web application in which the Power BI report is consumed using the Power BI API’s to Azure. The core difference is that there will be more development required here vs the previous method.

Wednesday, June 15, 2016

Microsoft acquires LinkedIn

On Monday 6/13/2016, Microsoft announced its acquisition of LinkedIn. This is a major game changer in the world of IT. But before we get to some of the advantages of this acquisition, Microsoft actually was working on a LinkedIn killer on its CRM dynamics platform. The idea was to generate more footprint for its CRM solution as well as create something unique with it. This was started in early 2012 and was way before its actual acquisition of LinkedIn. Here are my thoughts into where this acquisition will lead Microsoft & LinkedIn to:

  • Microsoft gains a huge database of professionals and organizations in various streams: This alone is the most massive gain by Microsoft. It could start targeting professionals/organizations to either move onto the Microsoft platform or join the Microsoft platform which can bolster its sales by a huge margin/
  • Microsoft integration of LinkedIn ads with Bing: Just imagine an organization trying to establish a marketing campaign. Now with LinkedIn ads and Bing ads integrated, an organization will have more opportunities to get page views or clicks and the potential to accelerate CTR's to conversions. This could come with a potential increase in cost of a campaign but it might make a really profitable decision.
  • Changes in technology trends at Microsoft will accelerate: Now since the foundation of LinkedIn is based on Cloud computing and Big Data, LinkedIn would have significantly made a lot of strides in terms of architectural and open source technology aspects. Now if these can be converted to products on the Azure suite, Microsoft might make a sizable profit on it. Also this will increase the trend of embracing open source vs closed technology stacks. (Azkaban, Voldemort, Increased usage of Kafka and Rabbit MQ etc...)
  • Added stream of Revenue: During an interview with Satya and Jeff, Satya did mention the integration with O365 and Azure as the major driver and Jeff mentioned that it made sense for LinkedIn to sell at this point. But I think the potential driver for this deal was the significant monetary gain that Microsoft will have in terms of Ads and revenue generated from LinkedIn subscribers. LinkedIn had major competition coming its way in the form of more local or regional based corporate social websites, but with this it actually gives it a major edge over its competitors due to Microsoft's global reach. Azure and O365 could potentially outlets for Microsoft to create LinkedIn based apps similar to Yammer
  • Allow organizations to create internal social platforms using the technology gained by LinkedIn: Microsoft could potentially make a configurable LinkedIn app for all its devices inclusive of the XBOX that organizations can tap into and create internal social networking platforms.

It would be fun to just do a prediction as to where this acquisition would lead Microsoft. probably a story to tell another day.

Tuesday, May 24, 2016

R - Notes

The following are basically my notes while studying R and is meant as a reference point for myself
Just a few pointers to anyone preparing for R or studying R:
  • Take a quick look at your statistical math basics before proceeding
  • Before applying any formula on your base data, try to understand what the formula is and how it was derived (this will make it easier for one to understand)
  • Use it in tangent with the Data Analysis in Excel
  • Refer to the cheat sheets available on
  • Segregate the workbench for each module
  • There are best practices that can be incorporated while programming in R
  • Try and jot notes when and where one can... 
  • Refer to existing data-sets embedded in R before jumping into a file
  • Refer to R programs written already in Azure ML

rnorm() by default has mean 0 and variance 1
head() has its own built in precision
*default settings in R can be modified by the options() function
options(digits = 15)
#will display 15 digits (Max digit for option display --> 22 and min digit --> 0): Error if > 22 --> Error in options(digits = 30) :
#invalid 'digits' parameter, allowed 0...22

#Infinity Operations
Inf/0 --> Inf
Inf * 0 --> Inf
Inf + 0 + (0/0) --> NaN
Inf + 0  --> Inf

*The ls() lists all the variable stored in R memory at a given point in time
*rm() will remove contents from the list

*To figure out the commands in R use the following command ? followed by the function that needs to be leveraged:

*Functions and Datastructures

*Again single valued functions and multi valued functions

*A special vector is called a factor
gl() --> generate levels

*creating a function in R
test<-function p="" x="">
return (x*x+(x^2))

*for loop in R

l*apply() vs sapply()

*Binding elements
rbind() --> bind elements in a matrix in a row manner
cbind() --> bind elements in a matrix in a columnar manner

*Every vector/matrix has a data mode....

*Can be found using mode()

*dimensions in matrices
=defines the number of rows and columns in a matrix

*can be used with dimnames(),rownames(),columnnames()

*Navigating through R package libraries really bad....

*HMISC --> Harrell misc... Contains many functions useful for data analysis, high-level graphics, utility operations, functions for computing sample size and power, importing and annotating datasets, imputing missing values, advanced table making, variable clustering, character string manipulation, conversion of R objects to LaTeX code, and recoding variables.

*R search path is the R working directory

getwd() --> get working directory

*to read in a table format:
testfile <- filename="" p="" read.table="">
read.fwf (fixed width file)

scan()--> reads a content of a file into a list or vector

f*ile() connections can create connections to files for read/write purposes
f1<-file p="">
close(f1)--> close the file connection

base::sink                Send R Output to a File
dput() --> save complicated R objects (in ASCII format)
dget() --> inverse of dput()

*file in conjunction with open="w" option
R has its own internal binary object
use save() & load() for binary format

*RODBC Package
Common Functions

*specify the version of the driver TDS_Version=8.0 and which port to use default:1433.
query<- from="" p="" selet="" t1="" t2="" test="">
check dimensions of a table using dim()

*summary() -> gives a range of stats on the underlying vector,list,matrix

Which function should you use to display the structure of an R object?

Log(dataframe) to investigate the data

Calculate Groups


Convert to frequency using prop.table()

Simulations in R
MCMC (Markov Chain Monte Carlo)
Performance Testing
Drawback --> Uncertainity

Pseudo Random Number Generator - The Mersenne Twister
Mersenne Prime


Uniform distribution - runif(5,min=1,max=2)
Normal distribution - rnorm(5,mean=2,sd=1)
Gamma distribution - rgamma(5,shape=2,rate=1)
Binomial distribution -rbinom(5,size=100,prob=.3)

Multinomial Distribution - rmultinom(5,size=100,prob=c(.2,.4,.7))

eruption.lm = lm(eruptions ~ waiting, data=faithful)
coeffs = coefficients(eruption.lm)
waiting = 80           # the waiting time 
duration = coeffs[1] + coeffs[2]*waiting 
duration --> Predicted value

loadd ggplot2 or ggplot using load("gplot")

Compare models using ANOVA
X1 <- nbsp="" span="" style="font-family: 'Lucida Console', 'courier new', monospace; font-size: 13px; line-height: 19.5px;">lm(y ~ x1 + x2 + x3 + x4, data=mydata)
Y1 <- lm="" span="" x1="" x2="" y="">
anova(X1, Y1)

Saturday, February 13, 2016

Hadoop Installation on Win 10 OS

Setting the Hadoop files prior to Spark installation on Win 10:
1. Ensure that your JAVA_HOME is properly set. A recommended approach here is to navigate to the installed Java folder in Program Files and copy the contents into a new folder
you can locate easily for eg:- C:\Projects\Java.
2. Create a user variable called JAVA_HOME and enter "C:\Projects\Java"
3. Add to the path system variable the following entry: "C:\Projects\Java\Bin;"
4. Create a HADOOP_HOME variable and specify the root path that contains all the Hadoop files for eg:- "C:\Projects\Hadoop"
5. Add to the path variable the bin location for your Hadoop repository: "C:\Projects\Hadoop\bin" <Keep track of your Hadoop installs like C:\Projects\Hadoop\2_5_0\bin>
6. Once these variables are set, open command prompt as an administrator and run the following commands to ensure that everything is set correctly:
A] java
B] javac
C] Hadoop
D] Hadoop Version
7. Also ensure your winutils.exe is in the Hadoop bin location.
< Download the same from ->
8. Also an error might related to the onfiguration location might occur -Add the following to the hadoop-env.cmd file to rectify the issue:
set HADOOP_PREFIX=C:\Projects\Hadoop

9. Another issue that I did face while leveraging Hadoop 2.6.0 install was the issue with the hadoop.dll. I had to recompile the source using MS VS to generate the hadoop.dll and pdb files and replaced the hadoop.dll which came along with the install.
10. Another error that I faced was "The system cannot find the batch label specified - nodemanager". Replace all the "\n" characters in the Yarn.cmd file to "\r\n".
11. Also replace the "\n" characters in the Hadoop.cmd file to "\r\n".

12. Yarn-site.xml change is as shown in the screenshot below:

13. Make changes to the core-site.xml as shown in the screenshot below:

14. Make the configuration changes as per the answer here :
15. Download Eclipse Helios for your Win OS to generate the jar's required for your map reduce applications. Use jdk1.7.0_71 and not the 1.8+ versions to compile your hadoop mapreduce programs.
16. Kickstart your Hadoop dfs and yarn and add data from any of your data sources and get ready to map reduce the heck out of it.... < A quick note,after formatting your named node it defaults to a tmp folder along with your machine name... in my case it is C:\tmp\hadoop-myPC\dfs\data>