YCrevolution wrote:
Version 3.0 Likely Features
- User selection of a specific URM race
- State residency feature for state law schools
- Significant work experience feature
Nutshell: Current tool good. URM predictability mediocre at best. Future plans: Horrible.
Analysis:
Alright, thus far watching this program progress has been completely logical. It started as a simple calculator based on indexes as well as some predictive indices to 'fill in the gaps'. Then got significantly more advanced through a compilation of LSN data, a move that made the program useful given that indices create very broad bands for accept, consider, reject. Further, it differentiated the program from other predictive tools by correlating LSN data with indices; something that had not been done before. Then an URM factor was added which used data from LSN.
This is where things start to get a little hairy because we're not correlating data off of 35,000 data points, but a fraction (13%) of that data. Assuming a roughly equal distribution of URM for each school, that correlates to between 20-40 data points per school being used to make a statistical correlation. A corollary assumption could be made that an uneven distribution of URM applications occurs for schools such that some schools may have 10 data points with others having 100. This provides an equally poor result without some indication of what schools are unevenly represented and therefore likely to be predicted better or worse (here's a feature that could be added).
A claim on these results is made that that LSP may actually be a 'better' predictor for URM's than for average applicants. This claim is just absurd and incredibly misleading. First, of course they will accurately predict past results; they are fit to past results. Second, and again, obviously a statistical tool is going to be better at predicting a result based on fewer data points than more. I could fit a 100th degree polynomial to 10 points and it'll give me a correlation of 1, but fit to 1,000,000 points and that correlation won't be quite as good.
Which takes us to these future plans. As of now, there is no data available outside of LSN regarding applicants cycle. There are no indices or other data forms that provide insight into admissions URM/Work experience etc. I don't have the same assembled data that you do, so this is somewhat speculative but I would surprised if I was that far off. You are going to now begin developing predictive statistical tools based on what is likely too small of a sample size. The URM data was already an incredibly small sample, breaking that data down into subcategories and you're going to be predicting people's admissions chances on 10 data points. Of course you could add data from previous cycles, but you run into a trade-off there where the further back you go, the less reliable the results. You've stated that as of now you don't have the data for this, but to list this as a future plan with an estimated launch date in 2010 is disingenuous without any means of solving this shortcoming.
With regard to work experience. I would LOVE to know how this is even possible to incorporate outside of some arbitrary additional numeric 'boost'. How you would even develop a reasonable method for determining that 'boost' is beyond me. There is just no statistical way to incorporate work experience into a model that is developed by being fit to data points. How do you discern between 5 years of employment at McDonalds and 2 years at a Hedge Fund? Oh, and again, a shortage of data is again apparent here.
You state that an ability to interprete the results is key to using your tool. Perhaps you could consider adding a "reliability of data" category. This would provide some indication of how much data the result is based on. I would strongly contend that no one will be able to interprete the results otherwise from these modifications because there simply isn't enough data for a valuable correlation.
Another future endeavor you should probably embark on is to take your tool as developed based on 2008-2009 data and verify the reliability based 2009-2010 results. This would be the best way of solving your problem that you're testing it's predictive ability on data used to generate the predictive tool. You would just need to make a script that input 2009-2010 data into your tool and output a result. You could then compare how your tool did versus actual results and make adjustments to the algorithms. Instead of broadening the areas you're predicting, make your current tool better.