Wednesday, September 29, 2010

Tuesday, September 28, 2010

probability: problem-set#3

My solutions for problem-set#3 from the MIT introductory probability course.





Monday, September 27, 2010

probability: problem-set#2

My solutions for problem-set#2 from the MIT introductory probability course.





Sunday, September 26, 2010

Wednesday, September 22, 2010

mistakes made in DDL script and lessons learned

Last year, I got the opportunity to create a java web application from scratch. Here, I want to talk about what I should have done differently when first DDL script for the app was created.

I'm correcting all these things now, but this is the technical debt that could have been avoided.

note: some of the jargons might be oracle specific

1. Create separate tablespaces for table, index and lobs.
This is a standard practice and helps DBAs in a lot of ways with management of db. The one I created had USERS as its default tablespace and none of CREATE TABLE, INDEX etc statements said anything about tablespace and everything resulted in USERS tablespace.

2. Give upfront thought to version upgradation from db perspective.
Once your application goes to production, and db gets loaded with a lot of data. It becomes rather difficult to upgrade schema from previous version of application to newer one. Its very important to have your migration strategy thought through and in place from the beginning itself.

3. Do not ignore "ON DELETE" clause when creating foreign key constraint.
In Oracle, you got two options here - ON DELETE SET NULL, ON DELETE CASCADE
you should be choosing one over the other based on how you want your data to evolve for the table.

4. Keep two users to manage your db.
One is the schema owner who runs all the DDLs and another one the schema user who has just enough privileges to run the application. Application should use the "schema user" user only to save application cdoe from getting any unwanted access to the db. Typically, schema user will need following privileges.

- CREATE SESSION
- CREATE SYNONYM
- SELECT/UPDATE/INSERT/DELETE privilege on all tables
- SELECT privilege on all SEQUENCEs

Tuesday, September 21, 2010

Similarities Discrete and Continuous Random Variables

From 1st three chapters of Introduction to Probability, it is pretty clear that many properties for both kind of random variables turn out to be strikingly similar. This post is an attempt to consolidate them for easy reference.

Note: In following table, $A_i$s are positive probability events that form a partition of sample space unless otherwise stated.













































DescriptionDiscreteContinuous
Basic Expectation Related
E[X]$\sum_x xp_X(x)$$\int xf_X(x)dx$
E[g(X)]$\sum_x g(x)p_X(x)$$\int g(x)f_X(x)dx$
if Y = aX + bE[Y] = aE[X] + bE[Y] = aE[X] + b
if Y = aX + bvar(Y) = $a^2$var(X)var(Y) = $a^2$var(X)
Joint PMF\PDFs
Marginal PMF\PDFs$p_X(x)$ = $\sum_y p_{X,Y}(x,y)$$f_X(x)$ = $\int f_{X,Y}(x,y)dy$
E[g(X,Y)]$\sum_x \sum_y g(x,y)p_{X,Y}(x,y)$$\int \int g(x,y)f_{X,Y}(x,y)dxdy$
if g(X,Y) = aX + bY + cE[g(X,Y)] = aE[X] + bE[Y] + cE[g(X,Y)] = aE[X] + bE[Y] + c
Conditional PMF\PDF$p_{X,Y}(x,y)$ = $p_Y(y)p_{X|Y}(x|y)$$f_{X,Y}(x,y)$ = $f_Y(y)f_{X|Y}(x|y)$
Conditional Expectation
E[X|A]$\sum_x xp_{X|A}(x)$$\int xf_{X|A}(x)dx$
E[g(X)|A]$\sum_x g(x)p_{X|A}(x)$$\int g(x)f_{X|A}(x)dx$
E[X|Y=y]$\sum_x xp_{X|Y}(x|y)$$\int xf_{X|Y}(x|y)dx$
E[g(X)|Y=y]$\sum_x g(x)p_{X|Y}(x|y)$$\int g(x)f_{X|Y}(x|y)dx$
E[g(X,Y)|Y=y]$\sum_x g(x,y)p_{X|Y}(x|y)$$\int g(x,y)f_{X|Y}(x|y)dx$
Total Probability Related
$p_X(x)$ = $\sum_{i = 1}^n P(A_i)p_{X|A_i}(x)$$f_X(x)$ = $\sum_{i = 1}^n P(A_i)f_{X|A_i}(x)$
P(A)$\sum_x P(A|X=x)p_X(x)$$\int P(A|X=x)f_X(x)dx$
Total Expectation Related
E[X]$\sum_y p_Y(y)E[X|Y=y]$$\int f_Y(y)E[X|Y=y]$
E[g(X)]$\sum_y p_Y(y)E[g(X)|Y=y]$$\int f_Y(y)E[g(X)|Y=y]$
E[g(X,Y)]$\sum_y p_Y(y)E[g(X,Y)|Y=y]$$\int f_Y(y)E[g(X,Y)|Y=y]$
E[X]$\sum_{i=1}^n P(A_i)E[X|A_i]$$\sum_{i=1}^n P(A_i)E[X|A_i]$
E[g(X)]$\sum_{i=1}^n P(A_i)E[g(X)|A_i]$$\sum_{i=1}^n P(A_i)E[g(X)|A_i]$
$A_i$s, partition on BE[X|B] = $\sum_{i=1}^n P(A_i|B)E[X|A_i\cap B]$
X and Y are independent
$p_{X,Y}$ = $p_X(x)p_Y(y)$$f_{X,Y}$ = $f_X(x)f_Y(y)$
E[XY]E[X]E[Y]E[X]E[Y]
E[g(X)h(Y)]E[g(X)]E[h(Y)]E[g(X)]E[h(Y)]
var(X + Y)var(X) + var(Y)var(X) + var(Y)



Some Continuous Random Variables

From Introduction to Probability.

Continuous Uniform Over [a,b]:

$f_X(x)$ = $\frac{1}{b-a}$ , if $a \leq x \leq b$ or 0 otherwise

E[X] = $\frac{a+b}{2}$
var(X) = $\frac{(b-a)^2}{12}$

Exponential with Parameter $\lambda$:

$f_X(x)$ = $\lambda e^{-\lambda x}$, if $x \geq 0$ or 0 otherwise

$F_X(x)$ = $1 - e^{-\lambda x}$, if $x \geq 0$ or 0 otherwise

E[X] = $\frac{1}{\lambda}$
var(x) = $\frac{1}{{\lambda}^2}$

Normal with Parameters $\mu$ and ${\sigma}^2 > 0$:

$f_X(x)$ = $\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{{(x-\mu)}^2}{2{\sigma}^2}}$

E[X] = $\mu$
var(X) = ${\sigma}^2$

Introduction to probability, ch#3(General Random Variables) notes - Part-2

Introduction to probability, ch#3(General Random Variables) notes - Part-2

-- Multiple Continuous Random Variables: --
Let X and Y be jointly continuous random variables with joint PDF $f_{X,Y}$

The joint, marginal and conditional PDFs are related as follow
$f_{X,Y}(x,y)$ = $f_Y(y)f_{X|Y}(x|y)$
$f_X(x)$ = $\int_{-\infty}^{\infty}f_{X,Y}(x,y)dy$ = $\int_{-\infty}^{\infty}f_Y(y)f_{X|Y}(x|y)dy$
conditional PDF $f_{X|Y}(x|y)$ is defined only for those y for which $f_Y(y)$ > 0

These PDFs can be used to calculate probabilities as follow:
$P((X,Y) \in B)$ = $\int \int_{(x,y) \in B} f_{X,Y}(x,y)dxdy$
$P(X \in A)$ = $\int_{x \in A} f_X(x)dx$
$(X \in A | y=y)$ = $\int_{x \in A} f_{X|Y}(x|y)dx$

They can be used to calculate Expectations:
E[g(X)] = $\int g(x)f_X(x)dx$
E[g(X,Y)] = $\int \int g(x,y)f_{X,Y}(x,y)dxdy$
E[g(X)|Y=y] = $\int g(x)f_{X|Y}(x|y)dx$
E[g(X,Y)|Y=y] = $\int g(x,y)f_{X|Y}(x|y)dx$

For any event A, we have following version of total probability theorem:
P(A) = $\int P(A | Y=y)f_Y(y)dy$

And, there are following versions of total expectation theorem:
E[X] = $\int E[X|Y=y]f_Y(y)dy$
E[g(X)] = $\int E[g(X)|Y=y]f_Y(y)dy$
E[g(X,Y)] = $\int E[g(X,Y)|Y=y]f_Y(y)dy$

Independence:
continuous random variables X and Y are independent iff
$f_{X,Y}(x,y)$ = $f_X(x)f_Y(y)$ for all x,y

and, following properties hold for independent X and Y
E[XY] = E[X]E[Y]
E[g(X)h(Y)] = E[g(X)]E[h(Y)]
var(X + Y) = var(X) + var(Y)

-- Bayes' Rule for Continuous Random Variables --
Let Y be a continuous random variable.

And, If X is a continuous random variable then
$f_{X|Y}(x|y)$ = $\frac{f_X(x)f_{Y|X}(y|x)}{f_Y(y)}$ = $\frac{f_X(x)f_{Y|X}(y|x)}{\int f_X(t)f_{Y|X}(y|t)dt}$

If A is an event, we have
P(A|Y=y)$f_Y(y)$ = P(A)$f_{Y|A}(y)$

So, P(A|Y=y) = $\frac{P(A)f_{Y|A}(y)}{P(A)f_{Y|A}(y|A) + P(A^c)f_{Y|A^c}(y)}$

If N is a discrete random variable, then we can choose A to be event N = n, we get
P(N=n|Y=y)$f_Y(y)$ = $p_N(n)f_{Y|N}(y|n)$

So, P(N=n|Y=y) = $\frac{p_N(n)f_{Y|N}(y|n)}{\sum_i p_N(i)f_{Y|N}(y|i)}$


-- Derived Distributions:
This describes how we calculate the PDF of Y, given that Y = g(X) and PDF of X

The approach is to calculate CDF of Y and then differentiate it.

$F_Y(y)$ = $P(g(X) \leq y)$ = $\int_{x|g(x) \leq y} f_X(x) dx$
and, $f_Y(y)$ = $\frac{dF_Y}{dy}(y)$

when Y = aX + b. Then
$f_Y(y)$ = $\frac{1}{|a|}f_X(\frac{y-b}{a})$

when Y = g(X) and g is *strictly* monotonic, then there always exists a function h s.t.
y = g(x) iff x = h(y)

Assuming h is differentiable, we have
$f_Y(y)$ = $f_X(h(y))|\frac{dh}{dy}(y)|$

Monday, September 20, 2010

Introduction to probability, ch#3(General Random Variables) notes - Part-1

Introduction to probability, ch#3(General Random Variables) notes - Part-1

A "continuous" random variable can take any value on the real number line unlike a discrete random variable.

PDF(Probability Density Function), counterpart of PMF of discrete random variables, of a continuous random variable X is a nonnegative function $f_X$ s.t.
P($X \in B$) = $\int_B f_X(x)dx$
for every subset of the real line. In particular, the probability that the value of X falls within an interval is
P($a \leq X \leq b$) = $\int_a^b f_X(x)dx$.

Note that, for a single value a, P(X = a) = $\int_a^a f_X(x)dx$ = 0. Hence, excluding or including the endpoints of any interval has no effect on its probability. That is
P($a \leq X \leq b$) = P($a < X < b$) = P($a \leq X < b$) = P($a < X \leq b$)

PDF also has to satisfy the normalization property, $\int_{-\infty}^{\infty} f_X(x)dx$ = 1

Also, note that PDF is not the probability of any particular event and hence not necessarily restricted to less than or equal to 1. It just has to be nonnegative and satisfy normalization property stated above.

-- Expectation and Variance: --
Let X be a continuous random variable with PDF $f_X(x)$

E[X] = $\int_{-\infty}^{\infty} xf_X(x)dx$
E[g(X)] = $\int_{-\infty}^{\infty} g(x)f_X(x)dx$

var(X) = E[(X - E[X])^2] = $\int_{-\infty}^{\infty} (x - E[X])^2f_X(x)dx$ = $E[X^2] - (E[X])^2$

If, Y = aX + b then
E[Y] = aE[X] + b
var(Y) = $a^2$var(X)


-- Cumulative Distribution Function --
CDF is a unified concept to deal with discrete as well as continuous random variables. CDF of X is denoted by $F_X$.
$F_X(x)$ = P($X \leq x$) = $\sum_{k \leq x} p_X(k)$ , if X is discrete
$F_X(x)$ = P($X \leq x$) = $\int_{-\infty}^{x} f_X(t)dt$, if X is continuous

Conceptually, $F_X(x)$ "accumulates" probability "upto" the value x.

If X is discrete, then we can obtain PMF of X from CDF of X as follow
$p_X(k)$ = $P(X \leq k)$ - $P(X \leq k-1)$ = $F_X(k)$ - $F_X(k-1)$, for all integers k

If X is continuous, then the PDF can be obtained by differentiating the CDF(assuming it has a derivative)
$f_X(x)$ = $\frac{dF_X}{dx}(x)$

-- Conditioning on an event --
Let A be some event, then conditional PDF of X, $f_{X|A}$ is defined as follow
$P(X \in B | A)$ = $\int_B f_{X|A}(x)dx$

in the special case when conditioning event is of the form $X \in A$, we have
$P(X \in B | X \in A)$ = $\frac{P(X \in B, X \in A)}{P(X \in A)}$ = $\frac{\int_{A \cap B}f_X(x)dx}{P(X \in A)}$

so, $\int_B f_{X|A}dx$ = $\frac{\int_{A \cap B}f_X(x)dx}{P(X \in A)}$

Hence, for events of the type $X \in A$
$f_{X|A}$ = $\frac{f_X(x)}{P(X \in A)}$, if $x \in A$ and 0 otherwise

A version of total probability theorem:
Let $A_1, A_2, ..., A_n$ form partition of the sample space with $P(A_i)$ for each i, then
$f_X(x)$ = $\sum_{i=1}^{n} P(A_i)f_{X|A_i}(x)$

Conditional Expectation
E[X|A] = $\int_{-\infty}^{\infty} xf_{X|A}(x)dx$
E[g(X)|A] = $\int_{-\infty}^{\infty} g(x)f_{X|A}(x)dx$

total expectation theorem:
E[X] = $\sum_{i=1}^{n} P(A_i)E[X|A_i]$
E[g(X)] = $\sum_{i=1}^{n} P(A_i)E[g(X)|A_i]$

Sunday, September 19, 2010

Some Discrete Random Variables

From Introduction to Probability.


Discrete Uniform over [a,b]
When X can take any value from a to b(both inclusive) with equal probability.

$p_X(k)$ = $\frac{1}{b-a+1}$ if k = a, a+1, a+2, ..., b-1, b
$p_X(k)$ = 0 otherwise

E[X] = $\frac{a+b}{2}$

var(X) = $\frac{(b-a)(b-a+2)}{12}


Bernoulli with Parameter p
It describes the success or failure in a single trial. For example, we can say that X = 1 if a toss results in HEAD or else X = 0 if the toss results in TAIL and probability of HEAD is p.

$p_X(k)$ = p if k = 1
$p_X(k)$ = 1-p if k = 0

E[X] = p

var(X) = p(1-p)


Binomial with Parameters p and n
It describes the number of successes in n independent bernoulli trials. For example, total number of HEADs obtained in n tosses.

$p_X(k)$ = ${n \choose k}p^k(1-p)^{n-k}$, if k = 0,1,2,...n

E[X] = np

var(X) = np(1-p)

Also, we can use bernoulli random variables to model binomial random variable as follow.
Let $X_1, X_2, ....X_n$ are n independent Bernoulli Random variables with parameter p. Then
Binomial Random Variable, X = $X_1 + X_2 + ... + X_n$


Geometric with Parameter p
It describes the number of trials until the first success, in a sequence of independent Bernoulli trials. For example, the number of tosses required to get first HEAD.

$p_X(k)$ = $(1-p)^{k-1}p$, if k = 1,2,....

E[X]] = $\frac{1}{p}$

var(X) = $\frac{1-p}{p^2}


Poisson with Parameter $\lambda$
It approximates the binomial PMF when n is large and p is small and $\lambda$ = np

$p_X(k)$ = $e^{-\lambda}\frac{\lambda^k}{k!}$, k = 0,1,2...

E[X] = $\lambda$ = var(X)

Introduction to probability, ch#2(Discrete Random Variables) notes

Introduction to probability, ch#2(Discrete Random Variables) notes

A Random variable is a real-valued function of the outcome of the experiment. Its called discrete if range of that function is finite or countably infinite.
A discrete rancom variable has an associated PMF(probability mass function), which gives the probability of each numerical value that the random variable can take. For example, if x is any possible value of a random variable X, the probability mass of X, denoted by $p_X(x)$, is the probability of the event {X = x} consisting of all the outcomes that give rise to a value of X = x, so
$p_X(x)$ = P({X = x})
Note that, as per the normalization probability axiom, $\sum_{x}p_X(x)$ = 1. Also for any given S, we have P($X \in S$) = $\sum_{x \in S}p_X(x)$

-- Expectation/Mean --
E[X] = $\sum_{x}xp_X(x)$
and for any function g(X) of X, E[g(X)] = $\sum_{x}g(x)p_X(x)$

-- Variance --
var(X) = E[$(X-E[X])^2$] = E[$X^2$] - ${(E[X])}^2$
Standard Deviation, $\sigma_X$ = $\sqrt{var(X)}$

-- nth moment of X --
nth moment of X = E[$X^n$]

Mean and Variance of a Linear Function of a Random Variable:
Let X and Y be two random variables s.t. Y = aX + b, where a and b are given scalars. Then
E[Y] = aE[X] + b
var(Y) = $a^2$var(X)

-- Joint PMFs: --
Let X and Y be random variables associated with the same experiment.

The joing PMF $p_{X,Y}$ of X and Y is defined by
$p_{X,Y}(x,y)$ = P(X = x, Y = y)

The marginal PMFs of X and Y respectively are
$p_X(x)$ = $\sum_{y}p_{X,Y}(x,y)$
$p_Y(y)$ = $\sum_{x}p_{X,Y}(x,y)$

E[g(X,Y)] = $\sum_{x}\sum_{y}g(x,y)p_{X,Y}(x,y)$
In case, g(X,Y) is linear. Then E[aX + bY + c] = aE[X] + bE[Y] + c

-- Conditional PMFs: --
The conditional PMF of X given an even A with P(A) > 0, is defined by
$p_{X|A}$(x) = P(X=x|A)
It follows all the probability axioms including normalization, that is $\sum_{x}p_{X|A}(x)$ = 1

The conditional PMF of X given another Random Variable Y = y is related to the joint PMF by
$p_{X,Y}(x,y)$ = $p_Y(y)p_{X|Y}(x|y)$

and this can be used to calculate marginal PMF of X(similar total probability theorem)
$p_X(x)$ = $\sum_{y}p_Y(y)p_{X|Y}(x|y)$

-- Conditional Expectation: --
Conditional expectation of X given an event A with positive probability is defined by
E[X|A] = $\sum_{x}xp_{X|A}(x)$
and for a function, g(X) we have
E[g(X)|A] = $\sum_{x}g(x)p_{X|A}(x)$

The conditional expectation of X given a value y of Y is defined by
E[X|Y=y] = $\sum_{x}xp_{X|Y=y}(x)$

And this leads to the total expectation theorem
E[X] = $\sum_{y}p_Y(y)E[X|y=y]$

Let $A_1, A_2,... ,A_n$ are positive probability evens such that they form a partition of the sample space. Then
E[X] = $\sum_{i=1}^{n}P(A_i)E[X|A_i]$

Let $A_1, A_2,... ,A_n$ be partition of an even B and $P(A_i \cap B)$ > 0 for all i. Then
E[X|B] = $\sum_{i=1}^{n}P(A_i|B)E[X|A_i \cap B]$


-- Independent Random Variables: --
Let A be an event, with P(A) > 0, and let X and Y be random variables associated with the same experiment. Then

X is independent of event A if, $p_{X|A}(x)$ = $p_X(x)$ for all x.

X and Y are independent. Then
$p_{X,Y}(x,y)$ = $p_X(x)p_Y(y)$ for all x,y
E[XY] = E[X]E[Y]
E[g(X)h(Y)] = E[g(X)]E[h(Y)]
var(X+Y) = var(X) + var(Y)

Thursday, September 16, 2010

Introduction to Probability - Ch#1 notes

Introduction to Probability, Ch#1 notes

A probabilistic model is a mathematical description of an uncertain situation. Every probabilistic model involves an underlying process, called the experiment, that will produce exactly one out of several possible (mutually exclusive)outcomes. The set of *all* possible outcomes is called the sample space of the experiment, and is denoted by $\Omega$. A subset of the sample space, a collection of possible outcomes, is called an event.
The probability law assigns to any event A, a nonnegative P(A)(called the probability of A) that encodes our knowledge or belief about the collective "likelihood" of the elements of A. The probability law *must* satisfy following probability axioms

Nonnegativity: P(A) $\geq$ 0, for every event A.
Additivity: If A and B are two disjoint events, then the probability of their union satisfies, $P(A \cup B)$ = P(A) + P(B) . It can be generalized to any number of disjoint events.
Normalization: P($\Omega$) = 1

-- Conditional Probability --
The conditional probability of an even A, given an event B with P(B) > 0, is defined by P(A|B) = $\frac{P(A \cap B)}{P(B)}$ and specifies a new (conditional) probability law on the same sample space $\Omega$. And, All the probability axioms remain valid on the conditional probability law.

Multiplication Rule:
Assuming that all of the conditioning events have positive probability, we have
P($\cap_{i = 1}^{n} A_i$) = $P(A_1)P(A_2|A_1)P(A_3|A_1 \cap A_2)...P(A_n|\cap_{i = 1}^{n-1} A_i)$

Total Probability Theorem:
Let $A_1, A_2, ..., A_n$ be disjoint events that form a partition of the sample space and assume that probability of all these events is greater than 0. Then, for any event B, we have
P(B) = $P(A_1)P(B|A_1)$ + $P(A_2)P(B|A_2)$ + ... + $P(A_n)P(B|A_n)$

Bayes' Rule:
Let $A_1, A_2, ..., A_n$ be disjoint events that form a partition of the sample space and assume that probability of all these events is greater than 0. Then, for any even B such that P(B) > 0, we have
$P(A_i|B)$ = $\frac{P(A_i)P(B|A_i)}{\sum_{i = 1}^{n}P(A_i)P(B|A_i)}$

-- Independence: --
Two evens A and B are independent if and only if, $P(A \cap B)$ = P(A)P(B). This can be generalized to any number of elements.
Also, as a consequence P(A|B) = P(A), provided P(B) > 0

Conditional Independence:
Two events A and B are said to be conditionally independent given another event C with P(C) > 0, if
$P(A \cap B| C)$ = P(A|C)P(B|C).
And, if in addition $P(B \cap C)$ > 0, then conditional independence is equivalent to the condition
P(A |$B \cap C$) = P(A|C)

Note that, independence does not imply conditional independence and viceversa

Saturday, September 11, 2010

Facebook Engineering Puzzles

A colleague of mine pointed me to the facebook engineering puzzles page yesterday. And I tried to give a shot to first one (the liar-liar puzzle). It took a good number of my hours today but it was fun.

And, I succeeded in first submission itself :). Here is the mail from facebook that I received:

from Facebook PuzzleBot
to g.himanshu@gmail.com
date Sat, Sep 11, 2010 at 4:45 PM
subject liarliar evaluation results
mailed-by facebook.com


Dear submitter,

Thank you for your submission of a puzzle solution to Facebook! After running your solution to liarliar (received on September 11, 2010, 12:37 am), I have determined it to be correct. Your solution ran for 2508.850 ms on its longest test case. If you have not already, try installing the official Facebook Puzzles application from http://apps.facebook.com/facebookpuzzles/ and publish a story about your solution! To publish, just go to http://apps.facebook.com/facebookpuzzles/mypuzzles.php and click on the publish link.

If you are applying for a position within Facebook, the puzzle difficulty solved will be taken into account with regards to how much time you had available to solve it (remember that Hors D'oeuvres are only tests for your benefit). I have taken the liberty of alerting our recruiting team about your success. Best of luck!

Sincerely,
-The puzzle robot


sorry, but no open sourced code this time for obvious reason. :P

[Update#1] I got the 2nd one(breathalyzer) accepted also. It took multiple submissions though :( and hence more time.

[Update(May 02 - 2011)] After a long time, yesterday I worked on Gattaca puzzle and nailed it. Regarding the timing, bot says, it took 4132.647 ms on the longest test case. My first solution was O(2^n) as I thought this was NP-hard problem which got rejected. I then followed the discussions on it and realized that this was *not* NP-hard and managed to work out the code that runs in O(n^2) in the worst case.

[Update(May 08 - 2011)] dancebattle is finished too. My first submission was accepted but longest test case time was 1108.22 ms. Then I optimized it a little bit, resubmitted and then bot says, it took 263.009 ms on it's longest test case :).

[Update(May 09 - 2011)] smallworld is done. Note that trivial O(n^2) solution does not succeed. Bot says, my solution took 9928.117 ms on its longest test case.

[Update(May 16 - 2011)] Today I received acceptance email for my solution to fridgemadness (Refrigerator Madness). Hardest part in this puzzle is to really understand the problem :). Regarding the timing, puzzle bot says, it took 362.562 ms on its longest test case.

Thursday, September 9, 2010

Habits that make you a better programmer in the long run

I read a thread on hacker news on "what habits made you a better programmer?". Here are some of the things(in no particular order) that resonated with me personally.

  • Learning new things. Every year or so try to learn a new major skill like functional languages, AI, Compilers, OS etc.
  • Think first, write later.
  • Read great code written by great people and read your own code also.
  • Surround yourself with real smart people.
  • Spend an extra hour or so after you're *done* with something to give it finishing touches, do the edge cases.
  • Work on what is exciting to you, then switch when the excitement wears off and switch back again when excitement for it comes back. Though, it needs some level of discipline to come back.
  • Have discipline to go 100%, keep the notes and write blog.
  • Learning to love new languages that make you think differently.
  • Learn one editor really well.
  • Write small utility project/programs for the tool/technology you're curious about rather than waiting for THE master project.
  • Find someone to pair program with who is a great programmer.
  • Persistence and Patience are a virtue.
  • Keeping healthy body and mind is very important. Practice yoga or something regularly.
  • Socialize, be part of hack/code fests and such forums and groups.

Wednesday, September 8, 2010

Performance Tuning

I'm scribbling here my recent(and ongoing) experience with performance tuning a java web application. Like any other article on optimization, I'll start with the quote, "Premature optimization is the root of all evil.". My take is that don't *optimize* prematurely, but do not make it an excuse for *not measuring* the performance of your system. It is good to know how your system performs at all times.

Anyway, here is the common way(was reinforced after attending a talk on performance tuning in devcamp-2010) that everyone knows about but I'm just putting here anyway...

1. Pick conservative hardware and install your application on it.

2. Create a large sample(but realistic) dataset. It might be good to just take a dump from production and load your system with it.

3. Put your system to extreme load. In general the load will be very general doing all the functions on your system. But, in the special case where you're just exploring one feature's performance/GC-activity then put the load on that feature only. Use any of the load generation tools... Apache Bench(non-gui) and Apache Jmeter(gui as well as non-gui modes) are two free ones.

4. Measure, Measure and Measure.
This gets tricky, I've tried to use hprof for memory dumps, Runtime.freeMemory and Runtime.maxMemory methods for memory consumption analysis but ended up using a profiler(a commertial one that is standard in our org) and it really helps.

5. Make your change if any and then measure again and keep repeating.

6. Automate, as much possible, the whole process for faster feedback loops.

So far we've found following major boosters..

1. Putting indexes, we always were very careful while putting indexes as they become overhead for write/update heavy tables, but later found there are many tables where indexes are helping alot more than we imagined.

2. We noticed that alot of garbage was generated and most of it was char[] data resulting from a lot of Strings. And finally figured out that the culprit was heavy amount of logs *printed*. However, Note that log statements in the code are not that expensive but printing them is. So, for prod/perf environment set print log level to ERROR and be very careful while putting log statements in app(I mean being judicious about what level you print the log in.. DEBUG/INFO/WARN/ERROR/FATAL), anything in log.error should really be an error.

3. Guarding log.debug statements with log.isDebugEnabled.

4. Caching with LRU(policy may very well depend on particular use case)

[Update:] We realized that our background jobs are very memory/cpu hungry, so we created a separate job node to run just the background jobs. It is improving the performance of serving synchronous client requests tremendously. However, we are still working on to make background jobs more efficient in terms of memory/cpu usage.

[Update:10-12-10] We realized that we usually used bind variables in sql/hql queries when wrote them by hand but when generated dynamically, we forgot(its easy) to use bind variables. Also, we never used bind variables wherever we had IN clauses. Since we all know that it is more efficient to use bind variables to reduce parsing times in db(they get more cache hits on already parsed sqls), So we are paying this technical debt and correcting all of this now. While we're doing it, it seems it would have been a simpler exercise if all the hql and sql statements were written in one place then it would have been easier to go through all of them as right now they are all scattered in different places in the big project code-base

[Update:14-12-10] We are resorting to use bulk manipulation queries and they are giving significant improvements. To take an extreme example, it is much efficient to directly execute query, "UPDATE TABLE t SET c = 'blah'", than iterating over every data object, updating it and saving it(unfortunately, ORMs give us the habit of thinking always in terms of objects and we forget to use these simple optimizations).
Also, I find the contents on this stackoverflow thread very informative and It seems I've also made some of the mistakes mentioned there :).

[Update:10-03-11] Never leave memory requirements of an operation unbounded. If your memory requirement increases in proportion to some dataset size and you believe that dataset size is upper bounded then place some mechanism so that if the dataset size goes beyond your assumption then somehow you get the notification. In my experience, usually my assumptions were either wrong or with time the load grew and assumption became wrong.. so it is much better if you plan your code so that dataset size is also bounded by design rather than by belief :). For example if you're fetching a list of records from database, probably there will be good enough reasons to use pagination parameters.

Sunday, September 5, 2010

Plan - Introduction to Probability

I'm working through, Introduction to Probability. Yesterday, at devcamp, I met Ravi and as always he was curious about my progress with the book. So, here it comes, the plan to finish this book :).

1. Ch-1, Sample Space and Probability -- Done
2. Ex* + Ch-2, Discrete Random Variables -- Done
3. Ex + Ch-3, General Random Variables -- Done [on Sep 05' 10]
4. Ex + Ch-4, Further Topics on Random Variables -- Done [on Sep 16' 10]

Problem-Set#1 - Done [on Sep 26' 10]
Problem-Set#2 - Done [on Sep 27' 10]
Problem-Set#3 - Done [on Sep 28' 10]
Problem-Set#4 - Done [on Sep 29' 10]
Problem-Set#5 - Done [on Oct 04' 10]
Problem-Set#6 - Done [on Oct 04' 10]

5. Ex + Ch-5, The Bernoulli and Poisson Processes - Done [on Oct 09' 10]
6. Ex + Ch-6, Markov Chains - Done [on Oct 16' 10]
7. Ex + Ch-7, Limit Theorems - Done [on Oct 23' 10]

Problem-Set#7 - Done [on Nov 2' 10]
Problem-Set#8 - Done [on Nov 2' 10]
Problem-Set#9 - Done [on Nov 2' 10]
Problem-Set#10 - Done [on Nov 23' 10]
Problem-Set#11 - Done [on Nov 23' 10]
Problem-Set#12 - Done [on Nov 23' 10]

Ex* - working through few end of chapter exercises of the previous chapter done.
Problem-Set* - This book is used as text book in probability course in MIT. I intend to work through all the problem-sets that were given as assignments in this course.

I have intentionally not put any dates above as they never work out for me(my schedule is not completely in my control). That said, I believe each chapter should take a week or so. I will keep on updating this post with dates whenever I get done with a chapter or a problem set.

While working through it, I'm trying to follow advices given here. I really wish to follow the advice of , make the idea your own, and looking for people working through this book.

@devcamp - 2010

This is the first devcamp I attended. Once I got to thoughtworks, I was promptly attended, felt welcomed and shown the hall where the action was about to begin. It was a big hall filled with many (supposedly)geeky people with laptops. There was a board in the front, mentioning 3 tracks, to be run in parallel, with stickies mentioning the topics to be presented.

All the presenters came to introduce themselves and to introduce their sessions. On another board some topics were listed for QnA and I found it interesting that people were voluntarily chosen from the audience to answer the questions for various topics.

And then, the sessions began... and throughout the day, things went on and on and on. Among many good things, there were some things I especially noted...

- Audience were demanding and asking questions, looking for the explanations.
- Presenters had running code to show.
- A very informal environment was created due to lack of space in the rooms, no one had any problems sitting on the floor.

And, did I say I got a devcamp tshirt also :).

All in all, It was a good experience. Thanks to all the people who were behind the event, it ran pretty smooth.