MATLAB Clauset Codes

After spending a week staring at a computer screen filled with numbers and strange codes, I have gotten a good handle on the MATLAB Clauset codes we are using to analyze blinking traces. Aaron Clauset created 3 MATLAB programs to fit blinking data with a probability distribution function and then asses the accuracy of this fitting. A power law distribution has commonly been seen across literature as potential probability distribution function for blinking traces. Because of this, the Clauset codes focus on the power law.

The two parameters alpha and x-min are focused on through out the calculations. X-min is determined to find the minimum value of x which most likely fit a power law. Alpha is the exponent in the power law equation. The program first selects every possible value of x-min (which is every unique value of x) and calculates the corresponding alpha value using maximum likelihood estimation. For every pair of x-min and alpha, a D-value is calculated. The D-value represents the deviation of the data from the model, in which a D-value equal to zero represents a perfect fit. The final parameters for the power law fit function are the x-min and alpha pair with the lowest D-value.

The second program allows us to create a a visual representation of this fit by producing a histogram. The raw data is plotted on a time duration vs (1-CDF) graph. The CDF is the probability of x less than Xi occurring; the sum of data points less than Xi divided by the total number of data points. Using the xmin and alpha values previously calculated, all values of x greater than or equal to x-min are plugged into the power law fit function; the resulting line is plotted against the raw data. This allows us to have a visual representation of how well the power law fits to the raw data, and how much data is disregarded due to the value of x-min (Fig.1).

The final part of the program assess the accuracy of the power law fit function. Although a fit function was calculate, this does not necessarily mean it is a good fit, it is simply the best fit power law function. The Clauset codes produce a P-value which essentially represents the goodness of fit. A p-value less than .05 rejects the power law as a good fit to the data. The p-value is calculated by producing synthesized data sets. X values greater than x-min from the raw data are plugged into the power law fit function to produce a new set of x(synth) values. X values less than x-min are chosen randomly and included in the synthesized data sets. Each synthesized data set then goes through the process from the beginning, finding a new best fit function and new D -value. The p-value is then a comparison of the new D-value to the original D-value from the raw data. Precision of the P-value increases with an increase in the number of synthesized data sets produced.

Now that we understand how to analyze the data, the next step is to analyze the data less than x-min that is left out in the power law fit calculations in the Clauset codes. This will require a new code and maybe even a different program from MATLAB. I am also starting to collect blinking traces of Rhodamine 6 G, a molecular similar to Rhodamine B. This means I start over and find 100 more molecules. Between that and researching new ways to analyze what data currently have, I will be very busy these next few weeks.