This whitepaper summary provides a broad description of the improvements in our AI over the past year, some specific results illustrating the benefits of these improvements, our predictions for Gazelles and their drivers, and industry-stylized facts for 2020 and 2021.
Important improvements we have deployed for this 2020 production run include, creating much larger analysis and prediction datasets, adding many new firm and contemporaneous signal variables, disaggregating advanced manufacturing into 5 new industry groups, completing and applying several complex algorithm and deep learning improvements, continuing to increase speed and memory usage of various programs using parallelization and GPU processing, and improving fitting accuracy for nearly all industry groups. Our overall Analysis or training dataset has increased ~30% distributed across 21 (previously 17) industrial groups. The Prediction dataset has increased by a little less than a factor of three to 18M firm X NAICS records. And we G-Scored 93% of our existing database including traded, local, headquarters, subsidiaries, branches, and international firms. We have added 59 new firm-level and contemporaneous (weekly and monthly) variables, designed to better capture short term firm-specific, weekly, and monthly events. This allows us to somewhat reflect the economic impacts of short-run shocks such as Covid-19.
These improvements have yielded both similar and some new results. AI predictive accuracy has increased for 18 of 21 industrial groups, remained unchanged for 1 group (Professional, Scientific and Technical workers (NAICS 540000:549999), and slightly declined for 2 industrial groups Arts, Recreation, Accommodation, and Food Services (NAICS 700000:799999) as well as Wholesale Trade (NAICS 400000:439999). And crucially, deep learning elements have become substantially more important in the creation of accurate predictions, this suggests that the complexity of interactions among firms, industries, and their regions, has appreciably increased.
Despite the fact that we oversample expanding firms in order to more comprehensively identify and distinguish expansion signals, the call shares across industrial groups has moved closer to our expectation, i.e. lower. Firms classified as expanding and ‘to-be-called’ in 2020 and early 2021 lies between 24% and 57% averaging around 36% compared to previous 26% to 60% averaging around 41%. This suggests, as expected, fewer firms are predicted to be expanding following this Covid-19 shock.
Interestingly, there are many consistencies but also some key changes in the top signals predicting firm expansions compared to previous runs. We have observed that many of the previous top 10% predictors remain among the current top 10%, showing consistencies over time in what drives firm expansions. However, the current run has many more deep learning hidden nodes in the group of top absolute magnitude association weights. Another change is that firm level variables have become much more important than in the past, and were observed to just eke out the top spot in terms of variables with the largest percent of their category in the top 10% of important signals. Regional signals took the second spot and industry indexed variables had the lowest percentage. Although, these differences are less than 1% indicating that all variable categories are needed to predict firm expansions in the current market climate, and that excluding any category of variables would likely have very detrimental effects.
More specifically, there have been some changes in the very top signals (the top 10 list from among the top 10% of large absolute association weights) predicting firm expansions. A key similarity with respect to previous runs includes the fact that the industry data for firm financials still plays a very important role, in fact even more (1/2 versus 1/5) of the top ten list of signals across industries are now industry indexed financial indicators. And, much of this dominance comes from signals that reflect the sales of the firm, industry, or country of firm residence. This indicates that industry signals are among the most important for many separate industries. Another similarity includes the fact that many regional indicators appear in the top 10% of signals for multiple industries. In fact, four regional signals reside on the top 10 list, further highlighting their importance. A key difference is the fact that regional work force characteristics now play a much larger role, with one (adult education rates) residing in the top 10 list. In contrast traditional BA degrees awarded in the region appear relatively less important. Another difference concerns firm-level variables. By some measures, and in contrast to previous runs, evidence suggests these signals are the most important. However, other evidence indicates that fewer firm level variables lie among the top 10% of signals for more than 4 industries, although each industry always has a few different ones in their top 10%. And, unlike previously, there is one firm-level variable in the top 10 list indicating its relevance across industries. Overall, one must conclude that firm level variables have become more important than they previously might have been, but only a few are common to many industries; most industries are sensitive to only one or two and different firm signals. And finally, industry and regional signals will still be the most effective for identifying expanding firms across multiple industries. These results indicate significant stability in the predictors of expansion, but also some critical differences not to be ignored.
In terms of deep learning, the introduction of dynamic deep learning has improved model fits for 11 of 21 industrial groups. However, it is still the case that about ½ of industry groups are best described by a shallow algorithm. Model comparison results supporting a shallow deep network can be interpreted to indicate that those industries have relatively lower complexity in the relationship among predictors and firm expansions. However, groups that appear to include the largest complexity, and in decreasing order, include: Education, Health, and Social Assistance (NAICS 600000:699999), Food Textiles, and Apparel (NAICS 300000:319999), Materials including Wood, Plastic, and Chemicals (NAICS 320000:329999), Fabricated Metals in general (NAICS 330000:399999), and Retail Trade (NAICS 440000:479999). For these groups, there may be very complex interactions among firm, industry, and regional signals that make the use of simple individual indicators or indices, or un-augmented statistics, particularly inaccurate.
Some other interesting industry specific stylized facts and anomalies can be identified by looking at the fitting results as well diagnostic statistics for the Analysis and Prediction datasets; Table 1 below summarizes these results. First for fitting accuracy, and in decreasing order, the most poorly fitted groups included Professional, Scientific, and Technical Workers (NAICS 540000:549999), Primary and Fabricated Metals, Machines, and Furniture (NAICS 331000:333999 and 337000:337999), Wholesale Trade (NAICS 400000: 439999), and Advanced Manufacturing of Other Medical Devices (NAICS 339000:339999). All these groups had 4% to 9% lower accuracy than the across industry groups average, and in several cases were outliers pulling down the across group averages. Extra care and longer lists are likely needed when considering these industrial groups.
Second, in terms of the Analysis dataset Star Test diagnostics (reflecting the accuracy of GScores of 1 to 5 in identifying known expansions) it is observed that GScores of 5 and 4 had substantially lower accuracy for 3 industry groups. These included, Advanced Manufacturing of Computers (NAICS 334000:334999), again Primary and Fabricated Metals, Machines, and Furniture (NAICS 331000:333999 and 337000:337999), and to a lesser extent Information Technology (NAICS 500000:519999). Interestingly, for these groups the accuracy of the GScores of 1 (predicted not-to-be expanding) was extremely high, indicating loss of accuracy for GScores of 4s and 5s is substituted with accuracy for GScores of 1. For these industry groups one should throw a wider net when building lists and at least also consider firms with GScores of 3, which are extremely accurate for Advanced Manufacturing of Computers due to the presence of multi-modal propensity distributions. And further, if building a list for these industries, if a firm on a list has a GScore of 1, it is likely wise to exclude that firm as the prediction of ‘not expanding’ has higher validity.
Third, in terms of the Analysis and Prediction dataset Share Test diagnostics (the percentage of firms predicted to be expanding and the percentages of firms of an industry group lying in various GScore categories), there were a few outliers with exceptionally high percentages. Note that the fact that these are outliers does not imply they are incorrect, but rather something unique is occurring in these industries, or that the industry has a large group of relatively high performers. More specifically, in the Analysis dataset, two groups had unusually high predicted call shares including, in decreasing magnitude, Education, Health, and Social Assistance (NAICS 600000:699999), and Mining/Oil Extraction and Power Generation, and Utilities (NAICS 200000:234999). The implication of this result is that one may expect more than typical numbers of expansions in these industries. In this same dataset 3 groups had unusually high share of firms in the GScore of 4 and 5 categories including in decreasing order: Other Services (NAICS 800000:811999), Information Technology (NAICS 500000:519999), and Mining/Oil Extraction and Power Generation, and Utilities (NAICS 200000:234999). This suggests these industries may have a larger than typical share of high performers again due to a multi-modal propensities distribution.
In the Prediction dataset two groups had unusually high predicted call shares including in decreasing magnitude, Management of Companies (NAICS 550000:559999) and Primary and Fabricated Metals, Machines, and Furniture (NAICS 331000:333999 and 337000:337999). In this same dataset 2 groups also had unusually high shares of firms in the GScore of 4 and 5 categories including in decreasing magnitude: Management of Companies (NAICS 550000:559999) and Primary and Fabricated Metals, Machines, and Furniture (NAICS 331000:333999 and 337000:337999). Again, for these groups this may mean that they have a higher than typical propensity to expand, and/or that there is a fairly large group of relatively high performers.
Next, in terms of where those Gazelles might reside for 2020 and early 2021, comparing the evidence reported in points above suggests that industries to pay a bit more attention to, because of potentially anomalous high-performance evidence in both datasets and for multiple diagnostics, includes: Education, Health, and Social Assistance (NAICS 600000:699999) and Management of Companies (NAICS 550000:559999). However conflicting evidence including relatively lower fitting accuracy coupled with predictions of high performance occur for Mining/Oil Extraction and Power Generation, and Utilities (NAICS 200000:234999), Primary and Fabricated Metals, Machines, and Furniture (NAICS 331000:333999 and 337000:337999), and to a lesser extent Information Technology (NAICS 500000:519999). For these latter groups, the to-be-called lists of firms should be much larger and include firms with more diverse GScores due to potentially noisy predicted performance. In contrast the industries with the lowest propensity of firms to expand include: Advanced Manufacturing of Transport Equipment (NAICS 336000:336999), Advanced Manufacturing of Computers (NACIS 334000:334999), Finance and Insurance (NAICS 520000:529999), and Other Services (NAICS 800000:811999). Finding a Gazelle among these latter four groups may be more difficult than for the preceding industries.
Finally, several industries appear to be more sensitive to short run contemporaneous signals than other industry groups. These seven industry groups have a higher share of the new contemporaneous and firm-level variables in their top association weights list. These include: Advanced Manufacturing of Electrical Devices (NAICS 335000:335999), Education, Health, and Social Assistance (NAICS 600000:699999), Administrative Support and Waste Remediation (NAICS 560000:599999), Advanced Manufacturing of Automotive and Aerospace (NAICS 336000:336999), Employment and Industrial Production levels, Management of Companies (NAICS 550000:559999), Advanced Manufacturing of Computers (NAICS 334000:334999), and finally, Retail Trade (NAICS 440000:479999). See Table 1 below for a summary of these unique industry anomalies and other stylized facts. Any industry group not listed in the Table indicates that expansion predictions and GScores may be interpreted without caveat when building lists.