sparse data

pages: 304 words: 82,395

Big Data: A Revolution That Will Transform How We Live, Work, and Think by Viktor Mayer-Schonberger, Kenneth Cukier

23andMe, Affordable Care Act / Obamacare, airport security, Apollo 11, barriers to entry, Berlin Wall, big data - Walmart - Pop Tarts, Black Swan, book scanning, book value, business intelligence, business process, call centre, cloud computing, computer age, correlation does not imply causation, dark matter, data science, double entry bookkeeping, Eratosthenes, Erik Brynjolfsson, game design, hype cycle, IBM and the Holocaust, index card, informal economy, intangible asset, Internet of things, invention of the printing press, Jeff Bezos, Joi Ito, lifelogging, Louis Pasteur, machine readable, machine translation, Marc Benioﬀ, Mark Zuckerberg, Max Levchin, Menlo Park, Moneyball by Michael Lewis explains big data, Nate Silver, natural language processing, Netflix Prize, Network effects, obamacare, optical character recognition, PageRank, paypal mafia, performance metric, Peter Thiel, Plato's cave, post-materialism, random walk, recommendation engine, Salesforce, self-driving car, sentiment analysis, Silicon Valley, Silicon Valley startup, smart grid, smart meter, social graph, sparse data, speech recognition, Steve Jobs, Steven Levy, systematic bias, the scientific method, The Signal and the Noise by Nate Silver, The Wealth of Nations by Adam Smith, Thomas Davenport, Turing test, vertical integration, Watson beat the top human players on Jeopardy!

. [>] Netflix identified individual—Ryan Singel, “Netflix Spilled Your Brokeback Mountain Secret, Lawsuit Claims,” Wired, December 17, 2009 (http://www.wired.com/threatlevel/2009/12/netflix-privacy-lawsuit/). On the Netflix data release—Arvind Narayanan and Vitaly Shmatikov, “Robust De-Anonymization of Large Sparse Datasets,” Proceedings of the 2008 IEEE Symposium on Security and Privacy, p. 111 et seq. (http://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf); Arvind Narayanan and Vitaly Shmatikov, “How to Break the Anonymity of the Netflix Prize Dataset,” October 18, 2006, arXiv:cs/0610105 [cs.CR] (http://arxiv.org/abs/cs/0610105).

…

“Space-Efficient Indexing of Chess Endgame Tables.” ICGA Journal 23, no. 3 (2000), pp. 148–162. Narayanan, Arvind, and Vitaly Shmatikov. “How to Break the Anonymity of the Netflix Prize Dataset.” October 18, 2006, arXiv:cs/0610105 (http://arxiv.org/abs/cs/0610105). ———. “Robust De-Anonymization of Large Sparse Datasets.” Proceedings of the 2008 IEEE Symposium on Security and Privacy, p. 111 (http://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf). Nazareth, Rita, and Julia Leite. “Stock Trading in U.S. Falls to Lowest Level Since 2008.” Bloomberg, August 13, 2012 (http://www.bloomberg.com/news/2012-08-13/stock-trading-in-u-s-hits-lowest-level-since-2008-as-vix-falls.html).

pages: 519 words: 102,669

Programming Collective Intelligence by Toby Segaran

algorithmic management, always be closing, backpropagation, correlation coefficient, Debian, en.wikipedia.org, Firefox, full text search, functional programming, information retrieval, PageRank, prediction markets, recommendation engine, slashdot, social bookmarking, sparse data, Thomas Bayes, web application

In the movie example, since every critic has rated nearly every movie, the dataset is dense (not sparse). On the other hand, it would be unlikely to find two people with the same set of del.icio.us bookmarks—most bookmarks are saved by a small group of people, leading to a sparse dataset. Item-based filtering usually outperforms user-based filtering in sparse datasets, and the two perform about equally in dense datasets. Tip To learn more about the difference in performance between these algorithms, check out a paper called "Item-based Collaborative Filtering Recommendation Algorithms" by Sarwar et al. at http://citeseer.ist.psu.edu/sarwar01itembased.html.

The Ethical Algorithm: The Science of Socially Aware Algorithm Design by Michael Kearns, Aaron Roth

23andMe, affirmative action, algorithmic bias, algorithmic trading, Alignment Problem, Alvin Roth, backpropagation, Bayesian statistics, bitcoin, cloud computing, computer vision, crowdsourcing, data science, deep learning, DeepMind, Dr. Strangelove, Edward Snowden, Elon Musk, fake news, Filter Bubble, general-purpose programming language, Geoffrey Hinton, Google Chrome, ImageNet competition, Lyft, medical residency, Nash equilibrium, Netflix Prize, p-value, Pareto efficiency, performance metric, personalized medicine, pre–internet, profit motive, quantitative trading / quantitative ﬁnance, RAND corporation, recommendation engine, replication crisis, ride hailing / ride sharing, Robert Bork, Ronald Coase, self-driving car, short selling, sorting algorithm, sparse data, speech recognition, statistical model, Stephen Hawking, superintelligent machines, TED Talk, telemarketer, Turing machine, two-sided market, Vilfredo Pareto

References and Further Reading Chapter 1: Algorithmic Privacy: From Anonymity to Noise References An extended discussion of successful “de-anonymization” attacks, including the Massachusetts GIC and Netflix cases, can be found in “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization” by Paul Ohm, which appeared in the UCLA Law Review 57 (2010). Details on the Netflix attack are described in “Robust De-anonymization of Large Sparse Datasets” by Arvind Narayanan and Vitaly Shmatikov, which was published in the IEEE Symposium on Security and Privacy (IEEE, 2008). Details of the original Genome-Wide Association Study attack can be found in “Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays” by Nils Homer, Szabolcs Szelinger, Margot Redman, David Duggan, Waibhav Tembe, Jill Muehling, John V.

pages: 296 words: 78,631

Hello World: Being Human in the Age of Algorithms by Hannah Fry

23andMe, 3D printing, Air France Flight 447, Airbnb, airport security, algorithmic bias, algorithmic management, augmented reality, autonomous vehicles, backpropagation, Brixton riot, Cambridge Analytica, chief data officer, computer vision, crowdsourcing, DARPA: Urban Challenge, data science, deep learning, DeepMind, Douglas Hofstadter, driverless car, Elon Musk, fake news, Firefox, Geoffrey Hinton, Google Chrome, Gödel, Escher, Bach, Ignaz Semmelweis: hand washing, John Markoff, Mark Zuckerberg, meta-analysis, Northpointe / Correctional Offender Management Profiling for Alternative Sanctions, pattern recognition, Peter Thiel, RAND corporation, ransomware, recommendation engine, ride hailing / ride sharing, selection bias, self-driving car, Shai Danziger, Silicon Valley, Silicon Valley startup, Snapchat, sparse data, speech recognition, Stanislav Petrov, statistical model, Stephen Hawking, Steven Levy, systematic bias, TED Talk, Tesla Model S, The Wisdom of Crowds, Thomas Bayes, trolley problem, Watson beat the top human players on Jeopardy!, web of trust, William Langewiesche, you are the product

Jon Brodkin, ‘Senate votes to let ISPs sell your Web browsing history to advertisers’, Ars Technica, 23 March 2017, https://arstechnica.com/tech-policy/2017/03/senate-votes-to-let-isps-sell-your-web-browsing-history-to-advertisers/. 16. Svea Eckert and Andreas Dewes, ‘Dark data’, DEFCON Conference 25, 20 Oct. 2017, https://www.youtube.com/watch?v=1nvYGi7-Lxo. 17. The researchers based this part of their work on Arvind Narayanan and Vitaly Shmatikov, ‘Robust de-anonymization of large sparse datasets’, paper presented to IEEE Symposium on Security and Privacy, 18–22 May 2008. 18. Michal Kosinski, David Stillwell and Thore Graepel. ‘Private traits and attributes are predictable from digital records of human behavior’, vol. 110, no. 15, 2013, pp. 5802–5. 19. Ibid. 20. Wu Youyou, Michal Kosinski and David Stillwell, ‘Computer-based personality judgments are more accurate than those made by humans’, Proceedings of the National Academy of Sciences, vol. 112, no. 4, 2015, pp. 1036–40. 21.

The Internet Trap: How the Digital Economy Builds Monopolies and Undermines Democracy by Matthew Hindman

A Declaration of the Independence of Cyberspace, accounting loophole / creative accounting, activist fund / activist shareholder / activist investor, AltaVista, Amazon Web Services, barriers to entry, Benjamin Mako Hill, bounce rate, business logic, Cambridge Analytica, cloud computing, computer vision, creative destruction, crowdsourcing, David Ricardo: comparative advantage, death of newspapers, deep learning, DeepMind, digital divide, discovery of DNA, disinformation, Donald Trump, fake news, fault tolerance, Filter Bubble, Firefox, future of journalism, Ida Tarbell, incognito mode, informal economy, information retrieval, invention of the telescope, Jeff Bezos, John Perry Barlow, John von Neumann, Joseph Schumpeter, lake wobegon effect, large denomination, longitudinal study, loose coupling, machine translation, Marc Andreessen, Mark Zuckerberg, Metcalfe’s law, natural language processing, Netflix Prize, Network effects, New Economic Geography, New Journalism, pattern recognition, peer-to-peer, Pepsi Challenge, performance metric, power law, price discrimination, recommendation engine, Robert Metcalfe, search costs, selection bias, Silicon Valley, Skype, sparse data, speech recognition, Stewart Brand, surveillance capitalism, technoutopianism, Ted Nelson, The Chicago School, the long tail, The Soul of a New Machine, Thomas Malthus, web application, Whole Earth Catalog, Yochai Benkler

So, for instance, a category might represent action movies, with movies with a lot of action at the top, and slow movies at the bottom, and correspondingly users who like action movies at the top, and those who prefer slow movies at the bottom.17 While this is true in theory, interpreting factors can be difficult in practice, as we shall see. svd had rarely been used with recommender systems because the technique performed poorly on “sparse” datasets, those (like the Netflix data) in which most of the values are missing. But Funk adapted the technique to ignore missing values, and found a way to implement the approach in only two lines of C code.18 Funk even titled the blog post explaining his method “Try This at Home,” encouraging other entrants to incorporate svd.

pages: 422 words: 104,457

Dragnet Nation: A Quest for Privacy, Security, and Freedom in a World of Relentless Surveillance by Julia Angwin

AltaVista, Ayatollah Khomeini, barriers to entry, bitcoin, Chelsea Manning, Chuck Templeton: OpenTable:, clean water, crowdsourcing, cuban missile crisis, data is the new oil, David Graeber, Debian, disinformation, Edward Snowden, Filter Bubble, Firefox, Free Software Foundation, Garrett Hardin, GnuPG, Google Chrome, Google Glasses, Ida Tarbell, incognito mode, informal economy, Jacob Appelbaum, John Gilmore, John Markoff, Julian Assange, Laura Poitras, Marc Andreessen, market bubble, market design, medical residency, meta-analysis, mutually assured destruction, operational security, Panopticon Jeremy Bentham, prediction markets, price discrimination, randomized controlled trial, RFID, Robert Shiller, Ronald Reagan, security theater, Silicon Valley, Silicon Valley startup, Skype, smart meter, sparse data, Steven Levy, Tragedy of the Commons, Upton Sinclair, WikiLeaks, Y2K, zero-sum game, Zimmermann PGP

., “A Face Is Exposed for AOL Searcher No. 4417749,” New York Times, August 9, 2006, http://www.nytimes.com/2006/08/09/technology/09aol.html?_r=0&gwh=2CACC912D19D87BDFD3A39B96C429022. In 2008, researchers at the University of Texas: Arvind Narayanan and Vitaly Shmatikov, “Robust De-anonymization of Large Sparse Datasets,” Security and Privacy (2008): 111–25, http://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf. In 2012, my Wall Street Journal team: Jennifer Valentino-Devries and Jeremy Singer-Vine, “They Know What You’re Shopping For,” Wall Street Journal, December 7, 2012, http://online.wsj.com/article/SB10001424127887324784404578143144132736214.html.

Reset by Ronald J. Deibert

23andMe, active measures, air gap, Airbnb, Amazon Web Services, Anthropocene, augmented reality, availability heuristic, behavioural economics, Bellingcat, Big Tech, bitcoin, blockchain, blood diamond, Brexit referendum, Buckminster Fuller, business intelligence, Cal Newport, call centre, Cambridge Analytica, carbon footprint, cashless society, Citizen Lab, clean water, cloud computing, computer vision, confounding variable, contact tracing, contact tracing app, content marketing, coronavirus, corporate social responsibility, COVID-19, crowdsourcing, data acquisition, data is the new oil, decarbonisation, deep learning, deepfake, Deng Xiaoping, disinformation, Donald Trump, Doomsday Clock, dual-use technology, Edward Snowden, Elon Musk, en.wikipedia.org, end-to-end encryption, Evgeny Morozov, failed state, fake news, Future Shock, game design, gig economy, global pandemic, global supply chain, global village, Google Hangouts, Great Leap Forward, high-speed rail, income inequality, information retrieval, information security, Internet of things, Jaron Lanier, Jeff Bezos, John Markoff, Lewis Mumford, liberal capitalism, license plate recognition, lockdown, longitudinal study, Mark Zuckerberg, Marshall McLuhan, mass immigration, megastructure, meta-analysis, military-industrial complex, move fast and break things, Naomi Klein, natural language processing, New Journalism, NSO Group, off-the-grid, Peter Thiel, planetary scale, planned obsolescence, post-truth, proprietary trading, QAnon, ransomware, Robert Mercer, Sheryl Sandberg, Shoshana Zuboff, Silicon Valley, single source of truth, Skype, Snapchat, social distancing, sorting algorithm, source of truth, sovereign wealth fund, sparse data, speech recognition, Steve Bannon, Steve Jobs, Stuxnet, surveillance capitalism, techlash, technological solutionism, the long tail, the medium is the message, The Structural Transformation of the Public Sphere, TikTok, TSMC, undersea cable, unit 8200, Vannevar Bush, WikiLeaks, zero day, zero-sum game

Drone pandemic: Will coronavirus invite the world to meet Big Brother? Retrieved from https://thebulletin.org/2020/04/drone-pandemic-will-coronavirus-invite-the-world-to-meet-big-brother/ How easy it is to unmask real identities contained in large personal data sets: Narayanan, A., & Shmatikov, V. (2008). Robust de-anonymization of large sparse datasets. IEEE Symposium on Security and Privacy, 111–125. http://doi.org/10.1109/SP.2008.33 “At least eight surveillance and cyber-intelligence companies attempting to sell repurposed spy and law enforcement tools”: Schectman, J., Bing, C., & Stubbs, J. (2020, April 28). Cyber-intel firms pitch governments on spy tools to trace coronavirus.

pages: 448 words: 117,325

Click Here to Kill Everybody: Security and Survival in a Hyper-Connected World by Bruce Schneier

23andMe, 3D printing, air gap, algorithmic bias, autonomous vehicles, barriers to entry, Big Tech, bitcoin, blockchain, Brian Krebs, business process, Citizen Lab, cloud computing, cognitive bias, computer vision, connected car, corporate governance, crowdsourcing, cryptocurrency, cuban missile crisis, Daniel Kahneman / Amos Tversky, David Heinemeier Hansson, disinformation, Donald Trump, driverless car, drone strike, Edward Snowden, Elon Musk, end-to-end encryption, fault tolerance, Firefox, Flash crash, George Akerlof, incognito mode, industrial robot, information asymmetry, information security, Internet of things, invention of radio, job automation, job satisfaction, John Gilmore, John Markoff, Kevin Kelly, license plate recognition, loose coupling, market design, medical malpractice, Minecraft, MITM: man-in-the-middle, move fast and break things, national security letter, Network effects, Nick Bostrom, NSO Group, pattern recognition, precautionary principle, printed gun, profit maximization, Ralph Nader, RAND corporation, ransomware, real-name policy, Rodney Brooks, Ross Ulbricht, security theater, self-driving car, Seymour Hersh, Shoshana Zuboff, Silicon Valley, smart cities, smart transportation, Snapchat, sparse data, Stanislav Petrov, Stephen Hawking, Stuxnet, supply-chain attack, surveillance capitalism, The Market for Lemons, Timothy McVeigh, too big to fail, Uber for X, Unsafe at Any Speed, uranium enrichment, Valery Gerasimov, Wayback Machine, web application, WikiLeaks, Yochai Benkler, zero day

That’s a calm year for me; in 2015, my average speed was 33 miles per hour. 144It wasn’t always like this: This is a good summary: Mark Hansen, Carolyn McAndrews, and Emily Berkeley (Jul 2008), “History of aviation safety oversight in the United States,” DOT/FAA/AR-08-39, National Technical Information Service, http://www.tc.faa.gov/its/worldpac/techrpt/ar0839.pdf. 144The result is that today: The taxi ride to the airport is the most dangerous part of the trip. 145Whenever industry groups write about this: Here’s one example: Coalition for Cybersecurity and Policy and Law (26 Oct 2017), “New whitepaper: Building a national cybersecurity strategy: Voluntary, flexible frameworks,” Center for Responsible Enterprise and Trade, https://create.org/news/new-whitepaper-building-national-cybersecurity-strategy. 145The Federal Aviation Administration has: April Glaser (15 Mar 2017), “Federal privacy laws won’t necessarily protect you from spying drones,” Recode, https://www.recode.net/2017/3/15/14934050/federal-privacy-laws-spying-drones-senate-hearing. 148in 2006, Netflix published 100 million: Katie Hafner (2 Oct 2006), “And if you liked the movie, a Netflix contest may reward you handsomely,” New York Times, http://www.nytimes.com/2006/10/02/technology/02netflix.html. 148Researchers were able to de-anonymize: Arvind Narayanan and Vitaly Shmatikov (18 May 2008), “Robust de-anonymization of large sparse datasets,” 2008 IEEE Symposium on Security and Privacy (SP ’08), https://dl.acm.org/citation.cfm?id=1398064. 148which surprised pretty much everyone: Paul Ohm (13 Aug 2009), “Broken promises of privacy: Responding to the surprising failure of anonymization,” UCLA Law Review 57, https://papers.ssrn.com/sol3/papers.cfm?

pages: 598 words: 134,339

Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World by Bruce Schneier

23andMe, Airbnb, airport security, AltaVista, Anne Wojcicki, AOL-Time Warner, augmented reality, behavioural economics, Benjamin Mako Hill, Black Swan, Boris Johnson, Brewster Kahle, Brian Krebs, call centre, Cass Sunstein, Chelsea Manning, citizen journalism, Citizen Lab, cloud computing, congestion charging, data science, digital rights, disintermediation, drone strike, Eben Moglen, Edward Snowden, end-to-end encryption, Evgeny Morozov, experimental subject, failed state, fault tolerance, Ferguson, Missouri, Filter Bubble, Firefox, friendly fire, Google Chrome, Google Glasses, heat death of the universe, hindsight bias, informal economy, information security, Internet Archive, Internet of things, Jacob Appelbaum, James Bridle, Jaron Lanier, John Gilmore, John Markoff, Julian Assange, Kevin Kelly, Laura Poitras, license plate recognition, lifelogging, linked data, Lyft, Mark Zuckerberg, moral panic, Nash equilibrium, Nate Silver, national security letter, Network effects, Occupy movement, operational security, Panopticon Jeremy Bentham, payday loans, pre–internet, price discrimination, profit motive, race to the bottom, RAND corporation, real-name policy, recommendation engine, RFID, Ross Ulbricht, satellite internet, self-driving car, Shoshana Zuboff, Silicon Valley, Skype, smart cities, smart grid, Snapchat, social graph, software as a service, South China Sea, sparse data, stealth mode startup, Steven Levy, Stuxnet, TaskRabbit, technological determinism, telemarketer, Tim Cook: Apple, transaction costs, Uber and Lyft, uber lyft, undersea cable, unit 8200, urban planning, Wayback Machine, WikiLeaks, workplace surveillance , Yochai Benkler, yottabyte, zero day

researchers were able to attach names: Michael Barbaro and Tom Zeller Jr. (9 Aug 2006), “A face is exposed for AOL Search No. 4417749,” New York Times, http://www.nytimes.com/2006/08/09/technology/09aol.html. Researchers were able to de-anonymize people: Arvind Narayanan and Vitaly Shmatikov (18–20 May 2008), “Robust de-anonymization of large sparse datasets,” 2008 IEEE Symposium on Security and Privacy, Oakland, California, http://dl.acm.org/citation.cfm?id=1398064 and http://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf. correlation opportunities pop up: Also for research purposes, in the mid-1990s the Massachusetts Group Insurance Commission released hospital records from state employees with the names, addresses, and Social Security numbers removed.

Data Mining: Concepts and Techniques: Concepts and Techniques by Jiawei Han, Micheline Kamber, Jian Pei

backpropagation, bioinformatics, business intelligence, business process, Claude Shannon: information theory, cloud computing, computer vision, correlation coefficient, cyber-physical system, database schema, discrete time, disinformation, distributed generation, finite state, industrial research laboratory, information retrieval, information security, iterative process, knowledge worker, linked data, machine readable, natural language processing, Netflix Prize, Occam's razor, pattern recognition, performance metric, phenotype, power law, random walk, recommendation engine, RFID, search costs, semantic web, seminal paper, sentiment analysis, sparse data, speech recognition, statistical model, stochastic process, supply-chain management, text mining, thinkpad, Thomas Bayes, web application

Figure 3.5 Principal components analysis. Y1 and Y2 are the first two principal components for the given data. PCA can be applied to ordered and unordered attributes, and can handle sparse data and skewed data. Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions. Principal components may be used as inputs to multiple regression and cluster analysis. In comparison with wavelet transforms, PCA tends to be better at handling sparse data, whereas wavelet transforms are more suitable for data of high dimensionality. 3.4.4. Attribute Subset Selection Data sets for analysis may contain hundreds of attributes, many of which may be irrelevant to the mining task or redundant.

…

seeStructural Clustering Algorithm for Networks core vertex 531 illustrated 532 scatter plots 54 2-D data set visualization with 59 3-D data set visualization with 60 correlations between attributes 54–56 illustrated 55 matrix 56, 59 schemas integration 94 snowflake 140–141 star 139–140 science applications 611–613 search engines 28 search space pruning 263, 301 second guess heuristic 369 selection dimensions 225 self-training 432 semantic annotations applications 317, 313, 320–321 with context modeling 316 from DBLP data set 316–317 effectiveness 317 example 314–315 of frequent patterns 313–317 mutual information 315–316 task definition 315 Semantic Web 597 semi-offline materialization 226 semi-supervised classification 432–433, 437 alternative approaches 433 cotraining 432–433 self-training 432 semi-supervised learning 25 outlier detection by 572 semi-supervised outlier detection 551 sensitivity analysis 408 sensitivity measure 367 sentiment classification 434 sequence data analysis 319 sequences 586 alignment 590 biological 586, 590–591 classification of 589–590 similarity searches 587 symbolic 586, 588–590 time-series 586, 587–588 sequential covering algorithm 359 general-to-specific search 360 greedy search 361 illustrated 359 rule induction with 359–361 sequential pattern mining 589 constraint-based 589 in symbolic sequences 588–589 shapelets method 590 shared dimensions 204 pruning 205 shared-sorts 193 shared-partitions 193 shell cubes 160 shell fragments 192, 235 approach 211–212 computation algorithm 212, 213 computation example 214–215 precomputing 210 shrinking diameter 592 sigmoid function 402 signature-based detection 614 significance levels 373 significance measure 312 significance tests 372–373, 386 silhouette coefficient 489–490 similarity asymmetric binary 71 cosine 77–78 measuring 65–78, 79 nominal attributes 70 similarity measures 447–448, 525–528 constraints on 533 geodesic distance 525–526 SimRank 526–528 similarity searches 587 in information networks 594 in multimedia data mining 596 simple random sample with replacement (SRSWR) 108 simple random sample without replacement (SRSWOR) 108 SimRank 526–528, 539 computation 527–528 random walk 526–528 structural context 528 simultaneous aggregation 195 single-dimensional association rules 17, 287 single-linkage algorithm 460, 461 singular value decomposition (SVD) 587 skewed data balanced 271 negatively 47 positively 47 wavelet transforms on 102 slice operation 148 small-world phenomenon 592 smoothing 112 by bin boundaries 89 by bin means 89 by bin medians 89 for data discretization 90 snowflake schema 140 example 141 illustrated 141 star schema versus 140 social networks 524–525, 526–528 densification power law 592 evolution of 594 mining 623 small-world phenomenon 592see alsonetworks social science/social studies data mining 613 soft clustering 501 soft constraints 534, 539 example 534 handling 536–537 space-filling curve 58 sparse data 102 sparse data cubes 190 sparsest cuts 539 sparsity coefficient 579 spatial data 14 spatial data mining 595 spatiotemporal data analysis 319 spatiotemporal data mining 595, 623–624 specialized SQL servers 165 specificity measure 367 spectral clustering 520–522, 539 effectiveness 522 framework 521 steps 520–522 speech recognition 430 speed, classification 369 spiral method 152 split-point 333, 340, 342 splitting attributes 333 splitting criterion 333, 342 splitting rules.

…

We then discuss how object dissimilarity can be computed for objects described by nominal attributes (Section 2.4.2), by binary attributes (Section 2.4.3), by numeric attributes (Section 2.4.4), by ordinal attributes (Section 2.4.5), or by combinations of these attribute types (Section 2.4.6). Section 2.4.7 provides similarity measures for very long and sparse data vectors, such as term-frequency vectors representing documents in information retrieval. Knowing how to compute dissimilarity is useful in studying attributes and will also be referenced in later topics on clustering (Chapter 10 and Chapter 11), outlier analysis (Chapter 12), and nearest-neighbor classification (Chapter 9). 2.4.1.

pages: 586 words: 186,548

Architects of Intelligence by Martin Ford

3D printing, agricultural Revolution, AI winter, algorithmic bias, Alignment Problem, AlphaGo, Apple II, artificial general intelligence, Asilomar, augmented reality, autonomous vehicles, backpropagation, barriers to entry, basic income, Baxter: Rethink Robotics, Bayesian statistics, Big Tech, bitcoin, Boeing 747, Boston Dynamics, business intelligence, business process, call centre, Cambridge Analytica, cloud computing, cognitive bias, Colonization of Mars, computer vision, Computing Machinery and Intelligence, correlation does not imply causation, CRISPR, crowdsourcing, DARPA: Urban Challenge, data science, deep learning, DeepMind, Demis Hassabis, deskilling, disruptive innovation, Donald Trump, Douglas Hofstadter, driverless car, Elon Musk, Erik Brynjolfsson, Ernest Rutherford, fake news, Fellow of the Royal Society, Flash crash, future of work, general purpose technology, Geoffrey Hinton, gig economy, Google X / Alphabet X, Gödel, Escher, Bach, Hans Moravec, Hans Rosling, hype cycle, ImageNet competition, income inequality, industrial research laboratory, industrial robot, information retrieval, job automation, John von Neumann, Large Hadron Collider, Law of Accelerating Returns, life extension, Loebner Prize, machine translation, Mark Zuckerberg, Mars Rover, means of production, Mitch Kapor, Mustafa Suleyman, natural language processing, new economy, Nick Bostrom, OpenAI, opioid epidemic / opioid crisis, optical character recognition, paperclip maximiser, pattern recognition, phenotype, Productivity paradox, radical life extension, Ray Kurzweil, recommendation engine, Robert Gordon, Rodney Brooks, Sam Altman, self-driving car, seminal paper, sensor fusion, sentiment analysis, Silicon Valley, smart cities, social intelligence, sparse data, speech recognition, statistical model, stealth mode startup, stem cell, Stephen Hawking, Steve Jobs, Steve Wozniak, Steven Pinker, strong AI, superintelligent machines, synthetic biology, systems thinking, Ted Kaczynski, TED Talk, The Rise and Fall of American Growth, theory of mind, Thomas Bayes, Travis Kalanick, Turing test, universal basic income, Wall-E, Watson beat the top human players on Jeopardy!, women in the workforce, working-age population, workplace surveillance , zero-sum game, Zipcar

JUDEA PEARL: Neural networks and reinforcement learning will all be essential components when properly utilized in causal modeling. MARTIN FORD: So, you think it might be a hybrid system that incorporates not just neural networks, but other ideas from other areas of AI? JUDEA PEARL: Absolutely. Even today, people are building hybrid systems when you have sparse data. There’s a limit, however, to how much you can extrapolate or interpolate sparse data if you want to get cause-effect relationships. Even if you have infinite data, you can’t tell the difference between A causes B and B causes A. MARTIN FORD: If someday we have strong AI, do you think that a machine could be conscious, and have some kind of inner experience like a human being?

…

Early on, we used these ideas from Bayesian statistics, Bayesian inference, and Bayesian networks, to use the mathematics of probability theory to formulate how people’s mental models of the causal structure of the world might work. It turns out that tools that were developed by mathematicians, physicists, and statisticians to make inferences from very sparse data in a statistical setting were being deployed in the 1990s in machine learning and AI, and it revolutionized the field. It was part of the move from an earlier symbolic paradigm for AI to a more statistical paradigm. To me, that was a very, very powerful way to think about how our minds were able to make inferences from sparse data. In the last ten years or so, our interests have turned more to where these mental models come from. We’re looking at the minds and brains of babies and young children, and really trying to understand the most basic kind of learning processes that build our basic common-sense understanding of the world.

…

However, it seemed like we still didn’t really have a handle on what intelligence is really about—a flexible, general-purpose intelligence that allows you to do all of those things that you can do. 10 years ago in cognitive science, we had a bunch of really satisfying models of individual cognitive capacities using this mathematics of ways people made inferences from sparse data, but we didn’t have a unifying theory. We had tools, but we didn’t have any kind of model of common sense. If you look at machine learning and AI technologies, and this is as true now as it was ten years ago, we were increasingly getting machine systems that did remarkable things that we used to think only humans could do.

pages: 625 words: 167,349

The Alignment Problem: Machine Learning and Human Values by Brian Christian

Albert Einstein, algorithmic bias, Alignment Problem, AlphaGo, Amazon Mechanical Turk, artificial general intelligence, augmented reality, autonomous vehicles, backpropagation, butterfly effect, Cambridge Analytica, Cass Sunstein, Claude Shannon: information theory, computer vision, Computing Machinery and Intelligence, data science, deep learning, DeepMind, Donald Knuth, Douglas Hofstadter, effective altruism, Elaine Herzberg, Elon Musk, Frances Oldham Kelsey, game design, gamification, Geoffrey Hinton, Goodhart's law, Google Chrome, Google Glasses, Google X / Alphabet X, Gödel, Escher, Bach, Hans Moravec, hedonic treadmill, ImageNet competition, industrial robot, Internet Archive, John von Neumann, Joi Ito, Kenneth Arrow, language acquisition, longitudinal study, machine translation, mandatory minimum, mass incarceration, multi-armed bandit, natural language processing, Nick Bostrom, Norbert Wiener, Northpointe / Correctional Offender Management Profiling for Alternative Sanctions, OpenAI, Panopticon Jeremy Bentham, pattern recognition, Peter Singer: altruism, Peter Thiel, precautionary principle, premature optimization, RAND corporation, recommendation engine, Richard Feynman, Rodney Brooks, Saturday Night Live, selection bias, self-driving car, seminal paper, side project, Silicon Valley, Skinner box, sparse data, speech recognition, Stanislav Petrov, statistical model, Steve Jobs, strong AI, the map is not the territory, theory of mind, Tim Cook: Apple, W. E. B. Du Bois, Wayback Machine, zero-sum game

For simplicity, we focus our discussion on the former, but both approaches have advantages, though they tend to result ultimately in fairly similar models. 55. Shannon, “A Mathematical Theory of Communication.” 56. See Jelinek and Mercer, “Interpolated Estimation of Markov Source Parameters from Sparse Data,” and Katz, “Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer”; for an overview, see Manning and Schütze, Foundations of Statistical Natural Language Processing. 57. This famous phrase originated in Bellman, Dynamic Programming. 58. See Hinton, “Learning Distributed Representations of Concepts,” and “Connectionist Learning Procedures,” and Rumelhart and McClelland, Parallel Distributed Processing. 59.

…

New York: Macmillan, 1892. Jaynes, Edwin T. “Information Theory and Statistical Mechanics.” Physical Review 106, no. 4 (1957): 620–30. Jefferson, Thomas. Notes on the State of Virginia. Paris, 1785. Jelinek, Fred, and Robert L. Mercer. “Interpolated Estimation of Markov Source Parameters from Sparse Data.” In Proceedings, Workshop on Pattern Recognition in Practice, edited by Edzard S. Gelsema and Laveen N. Kanal, 381–97. 1980. Jeon, Hong Jun, Smitha Milli, and Anca D. Drăgan. “Reward-Rational (Implicit) Choice: A Unifying Formalism for Reward Learning.” arXiv Preprint arXiv:2002.04833, 2020. Joffe-Walt, Chana.

…

“Curiosity and Interest: The Benefits of Thriving on Novelty and Challenge.” Oxford Handbook of Positive Psychology 2 (2009): 367–74. Kasparov, Garry. How Life Imitates Chess: Making the Right Moves, from the Board to the Boardroom. Bloomsbury USA, 2007. Katz, Slava. “Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer.” IEEE Transactions on Acoustics, Speech, and Signal Processing 35, no. 3 (1987): 400–01. Kellogg, Winthrop Niles, and Luella Agger Kellogg. The Ape and the Child: A Comparative Study of the Environmental Influence upon Early Behavior. Whittlesey House, 1933.

pages: 62 words: 14,996

SciPy and NumPy by Eli Bressert

Debian, Guido van Rossum, p-value, sparse data

This means that the sparse matrix was 100 times more memory efficient and the Eigen operation was roughly 150 times faster than the non-sparse cases. Tip If you’re unfamiliar with sparse matrices, I suggest reading http://www.scipy.org/SciPyPackages/Sparse, where the basics on sparse matrices and operations are discussed. In 2D and 3D geometry, there are many sparse data structures used in fields like engineering, computational fluid dynamics, electromagnetism, thermodynamics, and acoustics. Non-geometric instances of sparse matrices are applicable to optimization, economic modeling, mathematics and statistics, and network/graph theories. Using scipy.io, you can read and write common sparse matrix file formats such as Matrix Market and Harwell-Boeing, or load MatLab files.

pages: 589 words: 69,193

Mastering Pandas by Femi Anthony

Amazon Web Services, Bayesian statistics, correlation coefficient, correlation does not imply causation, data science, Debian, en.wikipedia.org, Internet of things, Large Hadron Collider, natural language processing, p-value, power law, random walk, side project, sparse data, statistical model, Thomas Bayes

It is not a public API. panel.py, panel4d.py, and panelnd.py: These provide the functionality for the pandas' Panel object. series.py: This defines the pandas Series class and its various methods that Series inherits from NDFrame and IndexOpsMixin. sparse.py: This defines import for handling sparse data structures. Sparse data structures are compressed whereby data points matching NaN or missing values are omitted. For more information on this, go to http://pandas.pydata.org/pandas-docs/stable/sparse.html. strings.py: These have various functions for handling strings. pandas/io This module contains various modules for data I/O.

Statistics in a Nutshell by Sarah Boslaugh

Antoine Gombaud: Chevalier de Méré, Bayesian statistics, business climate, computer age, confounding variable, correlation coefficient, experimental subject, Florence Nightingale: pie chart, income per capita, iterative process, job satisfaction, labor-force participation, linear programming, longitudinal study, meta-analysis, p-value, pattern recognition, placebo effect, probability theory / Blaise Pascal / Pierre de Fermat, publication bias, purchasing power parity, randomized controlled trial, selection bias, six sigma, sparse data, statistical model, systematic bias, The Design of Experiments, the scientific method, Thomas Bayes, Two Sigma, Vilfredo Pareto

A researcher might collect exact information on the number of children per household (0 children, 1 child, 2 children, 3 children, etc.) but choose to group this data into categories for the purpose of analysis, such as 0 children, 1–2 children, and 3 or more children. This type of grouping is often used if there are large numbers of categories and some of them contain sparse data. In the case of the number of children in a household, for instance, a data set might include a relatively few households with large numbers of children, and the low frequencies in those categories can adversely affect the power of the study or make it impossible to use certain analytical techniques.

…

The Pearson’s chi-square test is suitable for data in which all observations are independent (the same person is not measured twice, for instance) and the categories are mutually exclusive and exhaustive (so that no case may be classified into more than one cell, and all potential cases can be classified into one of the cells). It is also assumed that no cell has an expected value less than 1, and no more than 20% of the cells have an expected value less than 5. The reason for the last two requirements is that the chi-square is an asymptotic test and might not be valid for sparse data (data in which one or more cells have a low expected frequency). Yates’s correction for continuity is a procedure developed by the British statistician Frank Yates for the chi-square test of independence when applied to 2×2 tables. The chi-square distribution is continuous, whereas the data used in a chi-square test is discrete, and Yates’s correction is meant to correct for this discrepancy.

…

Use of Yates’s correction is not universally endorsed, however; some researchers feel that it might be an overcorrection leading to a loss of power and increased probability of a Type II error (wrongly failing to reject the null hypothesis). Some statisticians reject the use of Yates’s correction entirely, although some find it useful with sparse data, particularly when at least one cell in the table has an expected cell frequency of less than 5. A less controversial remedy for sparse categorical data is to use Fisher’s exact test, discussed later, instead of the chi-square test, when the distributional assumptions previously named (no more than 20% of cells with an expected value less than 5 and no cell with an expected value of less than 1) are not met.

pages: 504 words: 89,238

Natural language processing with Python by Steven Bird, Ewan Klein, Edward Loper

bioinformatics, business intelligence, business logic, Computing Machinery and Intelligence, conceptual framework, Donald Knuth, duck typing, elephant in my pajamas, en.wikipedia.org, finite state, Firefox, functional programming, Guido van Rossum, higher-order functions, information retrieval, language acquisition, lolcat, machine translation, Menlo Park, natural language processing, P = NP, search inside the book, sparse data, speech recognition, statistical model, text mining, Turing test, W. E. B. Du Bois

Its overall accuracy score is very low: >>> bigram_tagger.evaluate(test_sents) 0.10276088906608193 204 | Chapter 5: Categorizing and Tagging Words As n gets larger, the specificity of the contexts increases, as does the chance that the data we wish to tag contains contexts that were not present in the training data. This is known as the sparse data problem, and is quite pervasive in NLP. As a consequence, there is a trade-off between the accuracy and the coverage of our results (and this is related to the precision/recall trade-off in information retrieval). Caution! N-gram taggers should not consider context that crosses a sentence boundary.

…

. • A dictionary is used to map between arbitrary types of information, such as a string and a number: freq['cat'] = 12. We create dictionaries using the brace notation: pos = {}, pos = {'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}. • N-gram taggers can be defined for large values of n, but once n is larger than 3, we usually encounter the sparse data problem; even with a large quantity of training data, we see only a tiny fraction of possible contexts. 5.8 Summary | 213 • Transformation-based tagging involves learning a series of repair rules of the form “change tag s to tag t in context c,” where each rule fixes mistakes and possibly introduces a (smaller) number of errors. 5.9 Further Reading Extra materials for this chapter are posted at http://www.nltk.org/, including links to freely available resources on the Web.

…

.) ◑ Consider the regular expression tagger developed in the exercises in the previous section. Evaluate the tagger using its accuracy() method, and try to come up with ways to improve its performance. Discuss your findings. How does objective evaluation help in the development process? ◑ How serious is the sparse data problem? Investigate the performance of n-gram taggers as n increases from 1 to 6. Tabulate the accuracy score. Estimate the training data required for these taggers, assuming a vocabulary size of 105 and a tagset size of 102. ◑ Obtain some tagged data for another language, and train and evaluate a variety of taggers on it.

RDF Database Systems: Triples Storage and SPARQL Query Processing by Olivier Cure, Guillaume Blin

Amazon Web Services, bioinformatics, business intelligence, cloud computing, database schema, fault tolerance, folksonomy, full text search, functional programming, information retrieval, Internet Archive, Internet of things, linked data, machine readable, NP-complete, peer-to-peer, performance metric, power law, random walk, recommendation engine, RFID, semantic web, Silicon Valley, social intelligence, software as a service, SPARQL, sparse data, web application

An important drawback is related to the fact that most useful queries rarely retrieve all the information from a given tuple, but rather retrieve only a subset of it.That implies that a large portion of the tuple’s data is unnecessarily transferred into the main memory. This has an impact on the I/O efficiency of row stores. In Abadi (2007) the author states that column stores are good candidates for extremely wide tables and for databases handling sparse data. The paper demonstrates the potential of column stores for the Semantic Web through the storage of RDF. Based on these remarks, it’s not a surprise that the current trend with database vendors emphasizes that column stores are getting more popular and can, in fact, compete with row stores in many use cases.

…

A typical architecture may require a NoSQL key value store for serving a fast access to cached data, a standard RDBMS or NewSQL database to support high transaction rates, a RDF store to serve as a data warehouse and to enable data integration of Linked Open Data. REFERENCES Abadi, D.J., 2007. Column stores for wide and sparse data. CIDR, 292–297. Abadi, D., Madden, S., Ferreira, M., 2006. Integrating compression and execution in columnoriented database systems. In: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. ACM Press, New York, pp. 671–682. Abadi, D.J., Marcus, A., Madden, S., Hollenbach, K.J., 2007a.

pages: 161 words: 39,526

Applied Artificial Intelligence: A Handbook for Business Leaders by Mariya Yao, Adelyn Zhou, Marlene Jia

Airbnb, algorithmic bias, AlphaGo, Amazon Web Services, artificial general intelligence, autonomous vehicles, backpropagation, business intelligence, business process, call centre, chief data officer, cognitive load, computer vision, conceptual framework, data science, deep learning, DeepMind, en.wikipedia.org, fake news, future of work, Geoffrey Hinton, industrial robot, information security, Internet of things, iterative process, Jeff Bezos, job automation, machine translation, Marc Andreessen, natural language processing, new economy, OpenAI, pattern recognition, performance metric, price discrimination, randomized controlled trial, recommendation engine, robotic process automation, Salesforce, self-driving car, sentiment analysis, Silicon Valley, single source of truth, skunkworks, software is eating the world, source of truth, sparse data, speech recognition, statistical model, strong AI, subscription business, technological singularity, The future is already here

Even then, neural networks trained on tiger photos do not reliably recognize abstractions or representations of tigers, such as cartoons or costumes. Because we are Systems That Master, humans have no trouble with this. A System That Masters is an intelligent agent capable of constructing abstract concepts and strategic plans from sparse data. By creating modular, conceptual representations of the world around us, we are able to transfer knowledge from one domain to another, a key feature of general intelligence. As we discussed earlier, no modern AI system is an AGI, or artificial general intelligence. While humans are Systems That Master, current AI programs are not.

pages: 397 words: 113,304

Spineless: The Science of Jellyfish and the Art of Growing a Backbone by Juli Berwald

clean water, complexity theory, crowdsourcing, Downton Abbey, Great Leap Forward, Gregor Mendel, Intergovernmental Panel on Climate Change (IPCC), Kickstarter, microplastics / micro fibres, ocean acidification, Panamax, rent control, Ronald Reagan, Skype, sparse data, stem cell, Suez canal 1869, TED Talk, the scientific method, Wilhelm Olbers

The NCEAS team collected thirty-seven datasets spanning the years 1970 to 2011. Before 1970 there were fewer than ten datasets published on jellyfish in any year. After that, the number of datasets increased into the twenties and thirties. Some NCEAS members believed that comparing the period before 1970 with the period after 1970 didn’t make sense. During the years of sparse data, each bit of information carried more weight than during years that were more data rich. This could skew the analysis by giving a disproportionate impact to earlier data relative to later data. Other members argued that if data existed, it needed to be included, otherwise deciding what data to include and what to exclude imposed biases.

…

Lucas explained that if all the data were included, the analysis showed that the abundances of jellyfish oscillate in a cycle that repeats roughly every twenty years. The most recent upswing started in 2004, and we’re still in it. Jellyfish have been noticed more, not because of some aberration, but because we are on the part of the normal cycle that’s tracking upward. But if you did the analysis excluding the sparse data before 1970, the conclusion was different. Over the past forty years, the data revealed an oscillation, but that up-and-down cycle was superimposed on an overall increase in jellyfish abundances. This difference in how to perform the analysis—with or without the data before 1970—caused the rift in the NCEAS group.

pages: 398 words: 31,161

Gnuplot in Action: Understanding Data With Graphs by Philipp Janert

bioinformatics, business intelligence, Debian, general-purpose programming language, iterative process, mandelbrot fractal, pattern recognition, power law, random walk, Richard Stallman, six sigma, sparse data, survivorship bias

You can always come back to this chapter when you need a specific plot type. 67 68 5.1 CHAPTER 5 Doing it with style Choosing plot styles Different types of data call for different display styles. For instance, it makes sense to plot a smooth function with one continuous line, but to use separate symbols for a sparse data set where each individual point counts. Experimental data often requires error bars together with the data, whereas counting statistics call for histograms. Choosing an appropriate style for the data leads to graphs that are both informative and aesthetically pleasing. There are two ways to choose a style for the data: inline, as part of the plot command, or globally, using the set style directive.

…

Since lines are such fundamental objects, I have collected all this material in a separate section at the end of this chapter for easier reference (section 5.3). 72 CHAPTER 5 Doing it with style The linespoints style is a combination of the previous two: each data point is marked with a symbol, and adjacent points are connected with straight lines. This style is mostly useful for sparse data sets. DOTS The dots style prints a “minimal” dot (a single pixel for bitmap terminals) for each data point. This style is occasionally useful for very large, unsorted data sets (such as large scatter plots). Figure 1.2 in chapter 1 was drawn using dots. 5.2.2 Box styles Box styles, which draw a box of finite width, are sometimes useful for counting statistics, or for other data sets where the x values cannot take on a continuous spectrum of values.

pages: 260 words: 78,229

Without Conscience: The Disturbing World of the Psychopaths Among Us by Robert D. Hare

classic study, delayed gratification, In Cold Blood by Truman Capote, junk bonds, longitudinal study, Norman Mailer, Savings and loan crisis, sparse data, systems thinking, twin studies

Even more frightening is the possibility that “cool” but vicious psychopaths will become twisted role models for children raised in dysfunctional families or disintegrating communities where little value is placed on honesty, fair play, and concern for the welfare of others. “WHAT HAVE I DONE?” It is hard to imagine any parent of a psychopath who has not asked the question, almost certainly with a sense of desperation, “What have I done wrong as a parent to bring this about in my child?” The answer is, possibly nothing. To summarize our sparse data, we do not know why people become psychopaths, but current evidence leads us away from the commonly held idea that the behavior of parents bears sole or even primary responsibility for the disorder. This does not mean that parents and the environment are completely off the hook. Parenting behavior may not be responsible for the essential ingredients of the disorder, but it may have a great deal to do with how the syndrome develops and is expressed.

pages: 294 words: 77,356

Automating Inequality by Virginia Eubanks

autonomous vehicles, basic income, Black Lives Matter, business process, call centre, cognitive dissonance, collective bargaining, correlation does not imply causation, data science, deindustrialization, digital divide, disruptive innovation, Donald Trump, driverless car, Elon Musk, ending welfare as we know it, experimental subject, fake news, gentrification, housing crisis, Housing First, IBM and the Holocaust, income inequality, job automation, mandatory minimum, Mark Zuckerberg, mass incarceration, minimum wage unemployment, mortgage tax deduction, new economy, New Urbanism, payday loans, performance metric, Ronald Reagan, San Francisco homelessness, self-driving car, sparse data, statistical model, strikebreaker, underbanked, universal basic income, urban renewal, W. E. B. Du Bois, War on Poverty, warehouse automation, working poor, Works Progress Administration, young professional, zero-sum game

In the case of the AFST, Allegheny County is concerned with child abuse, especially potential fatalities. But the number of child maltreatment–related fatalities and near fatalities in Allegheny County is very low—luckily, only a handful a year. A statistically meaningful model cannot be constructed with such sparse data. Failing that, it might seem logical to use child maltreatment as substantiated by CYF caseworkers to stand in for actual child maltreatment. But substantiation is an imprecise metric: it simply means that CYF believes there is enough evidence that a child may be harmed to accept a family for services.

pages: 250 words: 75,586

When the Air Hits Your Brain: Tales From Neurosurgery by Frank Vertosick

butterfly effect, double helix, Dr. Strangelove, index card, medical residency, planned obsolescence, random walk, sparse data, zero-sum game

The course of an illness when doctors don’t interfere with it is called its natural history. Ironically, for many diseases (including SAH), medicine has been fiddling with them for as long as they have been recognized as diseases. We are, therefore, totally clueless about the natural history of those diseases, except for what sparse data we can glean from patients who escape our clutches, either because they are too sick or have stubbornly refused our care. Given this lack of hard data, a surgeon is left to choose the option for each patient. If the surgeon is aggressive, then the patient will be steered toward surgery. Unlike the bunion patient, who alone knows how much it hurts and how much surgical risk she is willing to assume to alleviate her suffering, candidates for statistical surgery are completely at the surgeon’s mercy.

pages: 284 words: 84,169

Talk on the Wild Side by Lane Greene

Affordable Care Act / Obamacare, Albert Einstein, Boris Johnson, deep learning, Donald Trump, ending welfare as we know it, experimental subject, facts on the ground, fake news, framing effect, Google Chrome, Higgs boson, illegal immigration, invisible hand, language acquisition, Large Hadron Collider, machine translation, meta-analysis, Money creation, moral panic, natural language processing, obamacare, public intellectual, Ronald Reagan, Sapir-Whorf hypothesis, Snapchat, sparse data, speech recognition, Steven Pinker, TED Talk, Turing test, Wall-E

(Actual correction by parents – “it’s brought, sweetie…” – seems to play much less of a role.) * So human children seem neither to blindly follow the rules they learn, nor to merely infer from lots of data. The rules get them going quickly, making useful sentences of their own after learning from relatively sparse data. And the data-gathering approach lets them gradually store up and recall some of the more intricate and rarer bits of the language they need. This is an absolutely stunning ability, given that we’re talking about kids who cannot yet tie shoelaces. Yet since every cognitively typical child does it, it’s not a miracle, even though it looks like one, a testament to the power of the human language faculty.

Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage by Zdravko Markov, Daniel T. Larose

Firefox, information retrieval, Internet Archive, iterative process, natural language processing, pattern recognition, random walk, recommendation engine, semantic web, sparse data, speech recognition, statistical model, William of Occam

The items are used as features to represent persons as vectors (rows in the person × item matrix). Then person vectors are clustered by using any clustering algorithm that we have discussed so far (e.g., k-means or EM). Finally, the missing values are taken from the cluster representation, where each person belongs. A problem in applying this approach involves the highly sparse data: In each person vector there are many missing values. The probabilistic algorithms can easily handle missing values; they are simply omitted from the computation of probabilities and the algorithm proceeds as usual. In similarity-based clustering such as k-means, a little adjustment is made for the missing feature values.

pages: 713 words: 93,944

Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement by Eric Redmond, Jim Wilson, Jim R. Wilson

AGPL, Amazon Web Services, business logic, create, read, update, delete, data is the new oil, database schema, Debian, domain-specific language, en.wikipedia.org, fault tolerance, full text search, general-purpose programming language, Kickstarter, Large Hadron Collider, linked data, MVC pattern, natural language processing, node package manager, random walk, recommendation engine, Ruby on Rails, seminal paper, Skype, social graph, sparse data, web application

As you recall, each row has text:, revision:author, and revision:comment columns. The links table has no such regularity. Each row may have one column or hundreds. And the variety of column names is as diverse as the row keys themselves (titles of Wikipedia articles). That’s OK! HBase is a so-called sparse data store for exactly this reason. To find out just how many rows are now in your table, you can use the count command. hbase> count 'wiki', INTERVAL => 100000, CACHE => 10000 Current count: 100000, row: Alexander wilson (vauxhall) Current count: 200000, row: Bachelor of liberal studies Current count: 300000, row: Brian donlevy ...

pages: 442 words: 94,734

The Art of Statistics: Learning From Data by David Spiegelhalter

Abraham Wald, algorithmic bias, Anthropocene, Antoine Gombaud: Chevalier de Méré, Bayesian statistics, Brexit referendum, Carmen Reinhart, Charles Babbage, complexity theory, computer vision, confounding variable, correlation coefficient, correlation does not imply causation, dark matter, data science, deep learning, DeepMind, Edmond Halley, Estimating the Reproducibility of Psychological Science, government statistician, Gregor Mendel, Hans Rosling, Higgs boson, Kenneth Rogoff, meta-analysis, Nate Silver, Netflix Prize, Northpointe / Correctional Offender Management Profiling for Alternative Sanctions, p-value, placebo effect, probability theory / Blaise Pascal / Pierre de Fermat, publication bias, randomized controlled trial, recommendation engine, replication crisis, self-driving car, seminal paper, sparse data, speech recognition, statistical model, sugar pill, systematic bias, TED Talk, The Design of Experiments, The Signal and the Noise by Nate Silver, The Wisdom of Crowds, Thomas Bayes, Thomas Malthus, Two Sigma

MRP is no panacea – if a large number of respondents give systematically misleading answers and so do not represent their ‘cell’, then no amount of sophisticated statistical analyses will counter that bias. But it appears to be beneficial to use Bayesian modelling of every single voting area, and we shall see later that this has been spectacularly successful in exit polls conducted on the day of elections. Bayesian ‘smoothing’ can bring precision to very sparse data, and the techniques are being increasingly used for modelling, for example, how diseases spread over space and time. Bayesian learning is also now seen as a fundamental process of human awareness of the environment, in that we have prior expectations about what we will see in any context, and then only need to take notice of unexpected features in our vision which are then used to update our current perceptions.

pages: 404 words: 92,713

The Art of Statistics: How to Learn From Data by David Spiegelhalter

Abraham Wald, algorithmic bias, Antoine Gombaud: Chevalier de Méré, Bayesian statistics, Brexit referendum, Carmen Reinhart, Charles Babbage, complexity theory, computer vision, confounding variable, correlation coefficient, correlation does not imply causation, dark matter, data science, deep learning, DeepMind, Edmond Halley, Estimating the Reproducibility of Psychological Science, government statistician, Gregor Mendel, Hans Rosling, Higgs boson, Kenneth Rogoff, meta-analysis, Nate Silver, Netflix Prize, Northpointe / Correctional Offender Management Profiling for Alternative Sanctions, p-value, placebo effect, probability theory / Blaise Pascal / Pierre de Fermat, publication bias, randomized controlled trial, recommendation engine, replication crisis, self-driving car, seminal paper, sparse data, speech recognition, statistical model, sugar pill, systematic bias, TED Talk, The Design of Experiments, The Signal and the Noise by Nate Silver, The Wisdom of Crowds, Thomas Bayes, Thomas Malthus, Two Sigma

MRP is no panacea—if a large number of respondents give systematically misleading answers and so do not represent their ‘cell’, then no amount of sophisticated statistical analyses will counter that bias. But it appears to be beneficial to use Bayesian modelling of every single voting area, and we shall see later that this has been spectacularly successful in exit polls conducted on the day of elections. Bayesian ‘smoothing’ can bring precision to very sparse data, and the techniques are being increasingly used for modelling, for example, how diseases spread over space and time. Bayesian learning is also now seen as a fundamental process of human awareness of the environment, in that we have prior expectations about what we will see in any context, and then only need to take notice of unexpected features in our vision which are then used to update our current perceptions.

pages: 356 words: 102,224

Pale Blue Dot: A Vision of the Human Future in Space by Carl Sagan

Albert Einstein, anthropic principle, Apollo 11, Apollo 13, cosmological principle, dark matter, Dava Sobel, Francis Fukuyama: the end of history, germ theory of disease, invention of the telescope, Isaac Newton, Johannes Kepler, Kuiper Belt, linked data, low earth orbit, military-industrial complex, Neil Armstrong, nuclear winter, planetary scale, power law, profit motive, remunicipalization, scientific worldview, Search for Extraterrestrial Intelligence, sparse data, Stephen Hawking, telepresence, time dilation

Early scientific speculation included fetid swamps crawling with monster amphibians, like the Earth in the Carboniferous Period; a world desert; a global petroleum sea; and a seltzer ocean dotted here and there with limestone-encrusted islands. While based on some scientific data, these “models" of Venus—the first dating from the beginnings of the century, the second from the 1930s, and the last two from the raid-1950s—were little more than scientific romances, hardly constrained by the sparse data available. Then, in 1956, a report was published in The Astrophysical Journal by Cornell H. Mayer and his colleagues. They had pointed a newly completed radio telescope, built in part for classified research, on the roof of the Naval Research Laboratory in Washington, D.C., at Venus and measured the flux of radio waves arriving at Earth.

pages: 372 words: 110,208

Who We Are and How We Got Here: Ancient DNA and the New Science of the Human Past by David Reich

23andMe, agricultural Revolution, Alfred Russel Wallace, carbon credits, Easter island, European colonialism, Google Earth, Great Leap Forward, invention of agriculture, invention of the wheel, invention of writing, mass immigration, meta-analysis, new economy, out of africa, phenotype, Scientific racism, sparse data, supervolcano, the scientific method, transatlantic slave trade

This is evident from the fact that our data include at least three distinct East African Forager groups within Africa—one spanning the ancient Ethiopian and ancient Kenyan, a second contributing large fractions of the ancestry of the ancient foragers from the Zanzibar Archipelago and Malawi, and a third represented in the present-day Hadza.38 Based on the sparse data we had, we were not able to determine the date when these groups separated from one another. But given the extended geographic span and the antiquity of human occupation in this region, it would not be surprising if some of the differences among these groups dated back tens of thousands of years.

pages: 502 words: 107,510

Natural Language Annotation for Machine Learning by James Pustejovsky, Amber Stubbs

Amazon Mechanical Turk, bioinformatics, cloud computing, computer vision, crowdsourcing, easy for humans, difficult for computers, finite state, Free Software Foundation, game design, information retrieval, iterative process, language acquisition, machine readable, machine translation, natural language processing, pattern recognition, performance metric, power law, sentiment analysis, social web, sparse data, speech recognition, statistical model, text mining

.(', 'Check'), ('10th', 'Sarajevo'), ('16100', 'Patranis'), ('1st', 'Avenue'), ('317', 'Riverside'), ('5000', 'Reward'), ('6310', 'Willoughby'), ('750hp', 'tire'), ('ALEX', 'MILLER'), ('Aasoo', 'Bane')] >>> finder1.apply_freq_filter(10) #look only at collocations that occur 10 times or more >>> finder1.nbest(bigram_measures.pmi, 10) [('United', 'States'), ('Los', 'Angeles'), ('Bhagwan', 'Shri'), ('martial', 'arts'), ('Lan', 'Yu'), ('Devi', 'Maa'), ('New', 'York'), ('qv', ')),'), ('qv', '))'), ('I', ")'")] >>> finder1.apply_freq_filter(15) >>> finder1.nbest(bigram_measures.pmi, 10) [('Bhagwan', 'Shri'), ('Devi', 'Maa'), ('New', 'York'), ('qv', ')),'), ('qv', '))'), ('I', ")'"), ('no', 'longer'), ('years', 'ago'), ('none', 'other'), ('each', 'other')] One issue with using this simple formula, however, involves the problem of sparse data. That is, the probabilities of observed rare events are overestimated, and the probabilities of unobserved rare events are underestimated. Researchers in computational linguistics have found ways to get around this problem to a certain extent, and we will return to this issue when we discuss ML algorithms in more detail in Chapter 7.

pages: 428 words: 103,544

The Data Detective: Ten Easy Rules to Make Sense of Statistics by Tim Harford

Abraham Wald, access to a mobile phone, Ada Lovelace, affirmative action, algorithmic bias, Automated Insights, banking crisis, basic income, behavioural economics, Black Lives Matter, Black Swan, Bretton Woods, British Empire, business cycle, Cambridge Analytica, Capital in the Twenty-First Century by Thomas Piketty, Cass Sunstein, Charles Babbage, clean water, collapse of Lehman Brothers, contact tracing, coronavirus, correlation does not imply causation, COVID-19, cuban missile crisis, Daniel Kahneman / Amos Tversky, data science, David Attenborough, Diane Coyle, disinformation, Donald Trump, Estimating the Reproducibility of Psychological Science, experimental subject, fake news, financial innovation, Florence Nightingale: pie chart, Gini coefficient, Great Leap Forward, Hans Rosling, high-speed rail, income inequality, Isaac Newton, Jeremy Corbyn, job automation, Kickstarter, life extension, meta-analysis, microcredit, Milgram experiment, moral panic, Netflix Prize, Northpointe / Correctional Offender Management Profiling for Alternative Sanctions, opioid epidemic / opioid crisis, Paul Samuelson, Phillips curve, publication bias, publish or perish, random walk, randomized controlled trial, recommendation engine, replication crisis, Richard Feynman, Richard Thaler, rolodex, Ronald Reagan, selection bias, sentiment analysis, Silicon Valley, sorting algorithm, sparse data, statistical model, stem cell, Stephen Hawking, Steve Bannon, Steven Pinker, survivorship bias, systematic bias, TED Talk, universal basic income, W. E. B. Du Bois, When a measure becomes a target

.* Convergence continued throughout the 1950s and 1960s and sometimes into the 1970s.8 It’s a powerful demonstration of the way that even scientists measuring essential and unchanging facts filter the data to suit their preconceptions. This shouldn’t be entirely surprising. Our brains are always trying to make sense of the world around us based on incomplete information. The brain makes predictions about what it expects, and tends to fill in the gaps, often based on surprisingly sparse data. That is why we can understand a routine telephone conversation on a bad line—until the point at which genuinely novel information such as a phone number or street address is being spoken through the static. Our brains fill in the gaps—which is why we see what we expect to see and hear what we expect to hear, just as Millikan’s successors found what they expected to find.

pages: 363 words: 109,077

The Raging 2020s: Companies, Countries, People - and the Fight for Our Future by Alec Ross

"Friedman doctrine" OR "shareholder theory", "World Economic Forum" Davos, Affordable Care Act / Obamacare, air gap, air traffic controllers' union, Airbnb, Albert Einstein, An Inconvenient Truth, autonomous vehicles, barriers to entry, benefit corporation, Bernie Sanders, Big Tech, big-box store, British Empire, call centre, capital controls, clean water, collective bargaining, computer vision, coronavirus, corporate governance, corporate raider, COVID-19, deep learning, Deng Xiaoping, Didi Chuxing, disinformation, Dissolution of the Soviet Union, Donald Trump, Double Irish / Dutch Sandwich, drone strike, dumpster diving, employer provided health coverage, Francis Fukuyama: the end of history, future of work, general purpose technology, gig economy, Gini coefficient, global supply chain, Goldman Sachs: Vampire Squid, Gordon Gekko, greed is good, high-speed rail, hiring and firing, income inequality, independent contractor, information security, intangible asset, invisible hand, Jeff Bezos, knowledge worker, late capitalism, low skilled workers, Lyft, Marc Andreessen, Marc Benioﬀ, mass immigration, megacity, military-industrial complex, minimum wage unemployment, mittelstand, mortgage tax deduction, natural language processing, Oculus Rift, off-the-grid, offshore financial centre, open economy, OpenAI, Parag Khanna, Paris climate accords, profit motive, race to the bottom, RAND corporation, ride hailing / ride sharing, Robert Bork, rolodex, Ronald Reagan, Salesforce, self-driving car, shareholder value, side hustle, side project, Silicon Valley, smart cities, Social Responsibility of Business Is to Increase Its Profits, sovereign wealth fund, sparse data, special economic zone, Steven Levy, stock buybacks, strikebreaker, TaskRabbit, tech bro, tech worker, transcontinental railway, transfer pricing, Travis Kalanick, trickle-down economics, Uber and Lyft, uber lyft, union organizing, Upton Sinclair, vertical integration, working poor

Dolber first learned of Rideshare Drivers United at an academic conference. He decided to attend one of its meetings in Los Angeles, and while there he crossed paths with Ivan Pardo. By that point, the organization had connected with only about five hundred of the estimated three hundred thousand rideshare drivers in California. Because Uber and Lyft publish only sparse data on their contractors, identifying and contacting new drivers was a laborious process. Up to that point, the group had recruited most of its members by canvassing parking lots at LAX. However, Dolber and Pardo developed a more scalable strategy for picking out rideshare drivers: Facebook. “Facebook is able to identify drivers better than anybody else,” Dolber said.

The Deepest Map by Laura Trethewey

9 dash line, airport security, Anthropocene, Apollo 11, circular economy, clean tech, COVID-19, crowdsourcing, digital map, Donald Trump, Elon Musk, en.wikipedia.org, Exxon Valdez, gentrification, global pandemic, high net worth, hive mind, Jeff Bezos, job automation, low earth orbit, Marc Benioﬀ, microplastics / micro fibres, Neil Armstrong, Salesforce, Scramble for Africa, Silicon Valley, South China Sea, space junk, sparse data, TED Talk, UNCLOS, UNCLOS

Neither of them realized that he had just given her a task that would consume the rest of her life. Marie began to draw in a looser physiographic style that showed the seafloor at an oblique angle, the way the Rocky Mountains look through an airplane window on a transcontinental flight. It took all her geographical and geological training to translate the sparse data points into more understandable terrain. On land, geologists climb a mountain, look around, take measurements, and make a map. Marie didn’t have the opportunity to survey the seafloor with her own eyes; she had to decide what features to emphasize and create the “feel” of the new frontier rather than a set of recorded data points.62 “It was a very demanding technique where you had data.

pages: 541 words: 109,698

Mining the Social Web: Finding Needles in the Social Haystack by Matthew A. Russell

Andy Rubin, business logic, Climategate, cloud computing, crowdsourcing, data science, en.wikipedia.org, fault tolerance, Firefox, folksonomy, full text search, Georg Cantor, Google Earth, information retrieval, machine readable, Mark Zuckerberg, natural language processing, NP-complete, power law, Saturday Night Live, semantic web, Silicon Valley, slashdot, social graph, social web, sparse data, statistical model, Steve Jobs, supply-chain management, text mining, traveling salesman, Turing test, web application

Ironically (in the context of the current discussion), the calculations involved in computing the PMI lead it to score high-frequency words lower than low-frequency words, which is opposite of the desired effect. Therefore, it is a good measure of independence but not a good measure of dependence (i.e., a less than ideal choice for scoring collocations). It has also been shown that sparse data is a particular stumbling block for PMI scoring, and that other techniques such as the likelihood ratio tend to outperform it. Tapping into Your Gmail Google Buzz is a great source of clean textual data that you can mine, but it’s just one of many starting points. Since this chapter showcases Google technology, this section provides a brief overview of how to tap into your Gmail data so that you can mine the text of what may be many thousands of messages in your inbox.

pages: 397 words: 121,211

Coming Apart: The State of White America, 1960-2010 by Charles Murray

affirmative action, assortative mating, blue-collar work, classic study, Community Supported Agriculture, corporate governance, David Brooks, en.wikipedia.org, feminist movement, gentrification, George Gilder, Haight Ashbury, happiness index / gross national happiness, helicopter parent, illegal immigration, income inequality, job satisfaction, labor-force participation, longitudinal study, low skilled workers, Menlo Park, new economy, public intellectual, Ralph Nader, Richard Florida, Silicon Valley, sparse data, Steve Jobs, The Bell Curve by Richard Herrnstein and Charles Murray, Tipper Gore, Unsafe at Any Speed, War on Poverty, working-age population, young professional

This variable has three values, drawing on the categories used in chapter 11: (1) de facto seculars—those either with no religion or professing a religion but attending worship services no more than once a year; (2) believers who profess a religion and attend services at least several times a year but do not qualify for the third category; and (3) those who attend services at least nearly every week and say that they have a strong affiliation with their religion. Community. Because of the GSS’s sparse data on measures of social and civic engagement during the 1990s and 2000s, we are restricted to an index of social trust, which sums the optimistic responses to the helpfulness, fairness, and trustworthiness questions discussed in chapter 14. The three items were coded so that the negative answer (e.g., “most people try to take advantage of you”) is scored as 0, the “it depends” answer is scored as 1, and the positive answer is scored as 2.

pages: 377 words: 21,687

Digital Apollo: Human and Machine in Spaceflight by David A. Mindell

"Margaret Hamilton" Apollo, 1960s counterculture, Apollo 11, Apollo 13, Apollo Guidance Computer, Charles Lindbergh, computer age, deskilling, Dr. Strangelove, Fairchild Semiconductor, fault tolerance, Gene Kranz, interchangeable parts, Lewis Mumford, Mars Rover, more computing power than Apollo, Neil Armstrong, Norbert Wiener, Norman Mailer, orbital mechanics / astrodynamics, Silicon Valley, sparse data, Stewart Brand, systems thinking, tacit knowledge, telepresence, telerobotics, The Soul of a New Machine

But the astronauts and Grumman were accustomed to systems from aircraft that had four gimbals, including Gemini, and nervous about the prospect of ‘‘forbidden attitudes’’ and the danger of gimbal lock. The issue hinged on the knotty problem of reliability—what was the likelihood that a gyro would fail, and hence require the redundancy of four instead of three? The trouble was, reliability is notoriously difficult to predict, and conflicts arose over how to interpret the sparse data. The IL provided its own estimates of reliability, based on experience with Polaris, showing that three gimbals would be reliable enough. But Grumman too developed estimates, which NASA found ‘‘highly pessimistic.’’ Grumman extrapolated reliability numbers from earlier missile programs and used them to argue for a redundant, fourgimbal platform.

Beautiful Data: The Stories Behind Elegant Data Solutions by Toby Segaran, Jeff Hammerbacher

23andMe, airport security, Amazon Mechanical Turk, bioinformatics, Black Swan, business intelligence, card file, cloud computing, computer vision, correlation coefficient, correlation does not imply causation, crowdsourcing, Daniel Kahneman / Amos Tversky, DARPA: Urban Challenge, data acquisition, data science, database schema, double helix, en.wikipedia.org, epigenetics, fault tolerance, Firefox, Gregor Mendel, Hans Rosling, housing crisis, information retrieval, lake wobegon effect, Large Hadron Collider, longitudinal study, machine readable, machine translation, Mars Rover, natural language processing, openstreetmap, Paradox of Choice, power law, prediction markets, profit motive, semantic web, sentiment analysis, Simon Singh, social bookmarking, social graph, SPARQL, sparse data, speech recognition, statistical model, supply-chain management, systematic bias, TED Talk, text mining, the long tail, Vernor Vinge, web application

For example, Alice might change her status to “Busy on the phone,” and then later change it to “Off the phone, anybody wanna chat?” When Alice changes her status, we write it into her profile record so that her friends can see it. The profile table might look like Table 4-1. Notice that to support evolving web applications, we must allow for a flexible schema and sparse data; not every record will have a value for every field, and adding new fields must be cheap. T A B L E 4 - 1 . User profile table Username FullName Location Status Alice Alice Smith Sunnyvale, CA Off the phone, anybody Alice345 wanna chat? IM Bob Bob Jones Singapore Eating dinner Charles Charles Adams New York, New York Sleeping BlogID Photo … … 3411 me.jpg … 5539 … … How should we update her profile record?

Amritsar 1919: An Empire of Fear and the Making of a Massacre by Kim Wagner

British Empire, colonial rule, European colonialism, Mahatma Gandhi, sparse data, trade route, Wall-E

There were just two women listed, namely Bibi Har Kaur and Masammat Bisso, which reflected the fact that women rarely joined such large gatherings. Of the overwhelmingly male list, fifteen were fifteen years or younger, while thirty-two were fifty or over, and the youngest was eight while the oldest was eighty years old.56 Combined with the sparse data from other supplementary records, this list provided the most comprehensive reflection of the composition of the crowd on 13 April. While the exact number of people who were killed at Jallianwala Bagh on 13 April would never be known, the figure of 379 (or 376) was certainly too low and reflected only those victims whose identity was confirmed.

pages: 626 words: 167,836

The Technology Trap: Capital, Labor, and Power in the Age of Automation by Carl Benedikt Frey

3D printing, AlphaGo, Alvin Toffler, autonomous vehicles, basic income, Bernie Sanders, Branko Milanovic, British Empire, business cycle, business process, call centre, Cambridge Analytica, Capital in the Twenty-First Century by Thomas Piketty, Charles Babbage, Clayton Christensen, collective bargaining, computer age, computer vision, Corn Laws, Cornelius Vanderbilt, creative destruction, data science, David Graeber, David Ricardo: comparative advantage, deep learning, DeepMind, deindustrialization, demographic transition, desegregation, deskilling, Donald Trump, driverless car, easy for humans, difficult for computers, Edward Glaeser, Elon Musk, Erik Brynjolfsson, everywhere but in the productivity statistics, factory automation, Fairchild Semiconductor, falling living standards, first square of the chessboard / second half of the chessboard, Ford Model T, Ford paid five dollars a day, Frank Levy and Richard Murnane: The New Division of Labor, full employment, future of work, game design, general purpose technology, Gini coefficient, Great Leap Forward, Hans Moravec, high-speed rail, Hyperloop, income inequality, income per capita, independent contractor, industrial cluster, industrial robot, intangible asset, interchangeable parts, Internet of things, invention of agriculture, invention of movable type, invention of the steam engine, invention of the wheel, Isaac Newton, James Hargreaves, James Watt: steam engine, Jeremy Corbyn, job automation, job satisfaction, job-hopping, John Maynard Keynes: Economic Possibilities for our Grandchildren, John Maynard Keynes: technological unemployment, Joseph Schumpeter, Kickstarter, Kiva Systems, knowledge economy, knowledge worker, labor-force participation, labour mobility, Lewis Mumford, Loebner Prize, low skilled workers, machine translation, Malcom McLean invented shipping containers, manufacturing employment, mass immigration, means of production, Menlo Park, minimum wage unemployment, natural language processing, new economy, New Urbanism, Nick Bostrom, Norbert Wiener, nowcasting, oil shock, On the Economy of Machinery and Manufactures, OpenAI, opioid epidemic / opioid crisis, Pareto efficiency, pattern recognition, pink-collar, Productivity paradox, profit maximization, Renaissance Technologies, rent-seeking, rising living standards, Robert Gordon, Robert Solow, robot derives from the Czech word robota Czech, meaning slave, safety bicycle, Second Machine Age, secular stagnation, self-driving car, seminal paper, Silicon Valley, Simon Kuznets, social intelligence, sparse data, speech recognition, spinning jenny, Stephen Hawking, tacit knowledge, The Future of Employment, The Rise and Fall of American Growth, The Wealth of Nations by Adam Smith, Thomas Malthus, total factor productivity, trade route, Triangle Shirtwaist Factory, Turing test, union organizing, universal basic income, warehouse automation, washing machines reduced drudgery, wealth creators, women in the workforce, working poor, zero-sum game

Trade emerged across continents, and new goods were discovered and consumed that had previously been unknown: colonial goods like sugar, spices, tea, tobacco, and rice—to name a few—were shipped distances that mankind had once not known existed. Though empirical evidence on the rise of international trade is sparse, data for the period 1622–1700 shows that British imports and exports doubled. The growing importance of trade is similarly suggested by the rapid expansion of shipping. Between 1470 and the early nineteenth century, the merchant fleet of Western Europe grew sevenfold.28 As many of the colonial goods and other imports became attainable for a growing share of the population, people started to drink more tea, often sweetened with sugar; bought more luxurious clothing; and discovered new spices for their meals.

pages: 635 words: 186,208

House of Suns by Alastair Reynolds

autonomous vehicles, cosmic microwave background, data acquisition, disinformation, gravity well, megastructure, planetary scale, space junk, sparse data, time dilation

Whatever it is is large, and it is moving towards us.’ Dalliance pushed her faculties to the limit, lowering her detection thresholds now that I had independent evidence that something else was lurking in the cloud. In a few moments, something appeared in the displayer - a hazy blob, framed in a box and accompanied by the exceedingly sparse data my ship had managed to extract. The object was well camouflaged but large - five or six kilometres wide - and Hesperus had been right about it coming nearer. ‘It could be a big ship, or a big ship carrying a Homunculus weapon, or just one of the weapons on its own,’ I said. ‘I see smaller signals grouped around it - other ships, perhaps.’

HBase: The Definitive Guide by Lars George

Alignment Problem, Amazon Web Services, bioinformatics, create, read, update, delete, Debian, distributed revision control, domain-specific language, en.wikipedia.org, fail fast, fault tolerance, Firefox, FOSDEM, functional programming, Google Earth, information security, Kickstarter, place-making, revision control, smart grid, sparse data, web application

This will improve the performance of the query significantly, since it uses a Scan internally, selecting only the mapped column families. If you have a sparsely set family, this will only scan the much smaller files on disk, as opposed to running a job that has to scan everything just to filter out the sparse data. Mapping an existing table requires the Hive EXTERNAL keyword, which is also used in other places to access data stored in unmanaged Hive tables, that is, those that are not under Hive’s control: hive> CREATE EXTERNAL TABLE hbase_table_2(key int, value string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val") TBLPROPERTIES("hbase.table.name" = "<existing-table-name>"); External tables are not deleted when the table is dropped within Hive.

Four Battlegrounds by Paul Scharre

2021 United States Capitol attack, 3D printing, active measures, activist lawyer, AI winter, AlphaGo, amateurs talk tactics, professionals talk logistics, artificial general intelligence, ASML, augmented reality, Automated Insights, autonomous vehicles, barriers to entry, Berlin Wall, Big Tech, bitcoin, Black Lives Matter, Boeing 737 MAX, Boris Johnson, Brexit referendum, business continuity plan, business process, carbon footprint, chief data officer, Citizen Lab, clean water, cloud computing, commoditize, computer vision, coronavirus, COVID-19, crisis actor, crowdsourcing, DALL-E, data is not the new oil, data is the new oil, data science, deep learning, deepfake, DeepMind, Demis Hassabis, Deng Xiaoping, digital map, digital rights, disinformation, Donald Trump, drone strike, dual-use technology, Elon Musk, en.wikipedia.org, endowment effect, fake news, Francis Fukuyama: the end of history, future of journalism, future of work, game design, general purpose technology, Geoffrey Hinton, geopolitical risk, George Floyd, global supply chain, GPT-3, Great Leap Forward, hive mind, hustle culture, ImageNet competition, immigration reform, income per capita, interchangeable parts, Internet Archive, Internet of things, iterative process, Jeff Bezos, job automation, Kevin Kelly, Kevin Roose, large language model, lockdown, Mark Zuckerberg, military-industrial complex, move fast and break things, Nate Silver, natural language processing, new economy, Nick Bostrom, one-China policy, Open Library, OpenAI, PalmPilot, Parler "social media", pattern recognition, phenotype, post-truth, purchasing power parity, QAnon, QR code, race to the bottom, RAND corporation, recommendation engine, reshoring, ride hailing / ride sharing, robotic process automation, Rodney Brooks, Rubik’s Cube, self-driving car, Shoshana Zuboff, side project, Silicon Valley, slashdot, smart cities, smart meter, Snapchat, social software, sorting algorithm, South China Sea, sparse data, speech recognition, Steve Bannon, Steven Levy, Stuxnet, supply-chain attack, surveillance capitalism, systems thinking, tech worker, techlash, telemarketer, The Brussels Effect, The Signal and the Noise by Nate Silver, TikTok, trade route, TSMC

Rather than train a model to identify the broad category “snake,” which would not be helpful for determining whether or not the snake was poisonous, the iNaturalist challenge identified specific species. From a technical standpoint, the iNaturalist challenge also pushed the boundaries in training models on sparse data. While the 2018 dataset included 450,000 images across 8,000 categories, the training images were not evenly distributed across categories. Some categories had hundreds of training images while rarer species had only a few dozen images. (For comparison, ImageNet’s goal is an average of 1,000 images per category.)

pages: 795 words: 215,529

Genius: The Life and Science of Richard Feynman by James Gleick

Albert Einstein, American ideology, Arthur Eddington, Brownian motion, Charles Babbage, disinformation, double helix, Douglas Hofstadter, Dr. Strangelove, Eddington experiment, Ernest Rutherford, gravity well, Gödel, Escher, Bach, Higgs boson, Isaac Newton, John von Neumann, Menlo Park, military-industrial complex, Murray Gell-Mann, mutually assured destruction, Neil Armstrong, Norbert Wiener, Norman Mailer, pattern recognition, Pepto Bismol, Richard Feynman, Richard Feynman: Challenger O-ring, Ronald Reagan, Rubik’s Cube, Sand Hill Road, Schrödinger's Cat, sexual politics, sparse data, Stephen Hawking, Steven Levy, the scientific method, Thomas Kuhn: the structure of scientific revolutions, uranium enrichment

Across the continent, where the Jet Propulsion Laboratory in Pasadena served as the army’s main collaborator in rocket research, a team was struggling with the task of tracking the satellite’s course. They used a room-size IBM 704 digital computer. It was temperamental. They entered the primitively sparse data available for tracking the metal can that the army’s rocket had hurled forward: the frequency of the radio signal, changing Doppler-fashion as the velocity in the line of flight changed; the time of disappearance from the observers at Cape Canaveral; observations from other tracking stations. The JPL team had learned that small variations in the computer’s input caused enormous variations in its output.

pages: 933 words: 205,691

Hadoop: The Definitive Guide by Tom White

Amazon Web Services, bioinformatics, business intelligence, business logic, combinatorial explosion, data science, database schema, Debian, domain-specific language, en.wikipedia.org, exponential backoff, fallacies of distributed computing, fault tolerance, full text search, functional programming, Grace Hopper, information retrieval, Internet Archive, Kickstarter, Large Hadron Collider, linked data, loose coupling, openstreetmap, recommendation engine, RFID, SETI@home, social graph, sparse data, web application

[126] On regionserver crash, when running on an older version of Hadoop, edits written to the commit log kept in HDFS were not recoverable, as files that had not been properly closed lost all edits no matter how much had been written to them at the time of the crash. [127] Yes, this file is named for Hadoop, though it’s for setting up HBase metrics. [128] “Column-Stores for Wide and Sparse Data” by Daniel J. Abadi. Chapter 14. ZooKeeper So far in this book, we have been studying large-scale data processing. This chapter is different: it is about building general distributed applications using Hadoop’s distributed coordination service, called ZooKeeper. Writing distributed applications is hard.

pages: 796 words: 223,275

The WEIRDest People in the World: How the West Became Psychologically Peculiar and Particularly Prosperous by Joseph Henrich

agricultural Revolution, Bartolomé de las Casas, behavioural economics, British Empire, charter city, cognitive dissonance, Columbian Exchange, correlation does not imply causation, cotton gin, Daniel Kahneman / Amos Tversky, dark matter, delayed gratification, discovery of the americas, Edward Glaeser, en.wikipedia.org, endowment effect, epigenetics, European colonialism, experimental economics, financial innovation, Flynn Effect, fundamental attribution error, glass ceiling, income inequality, invention of agriculture, Isaac Newton, James Hargreaves, James Watt: steam engine, Johannes Kepler, John Snow's cholera map, joint-stock company, knowledge economy, land reform, longitudinal study, Menlo Park, mental accounting, meta-analysis, New Urbanism, pattern recognition, Pearl River Delta, profit maximization, randomized controlled trial, Republic of Letters, rolodex, social contagion, social web, sparse data, spinning jenny, Spread Networks laid a new fibre optics cable between New York and Chicago, Stanford marshmallow experiment, tacit knowledge, The Wealth of Nations by Adam Smith, theory of mind, trade route, Tyler Cowen, ultimatum game, wikimedia commons, working-age population, World Values Survey, zero-sum game

The relationships between the prevalence of first cousin marriage in regions of Spain, Italy, France, and Turkey and four dimensions of psychology: (A) Individualism-Independence, (B) Conformity-Obedience, (C) Impersonal Trust, and (D) Impersonal Fairness. Although the analyses displayed in Figure 7.2 are based on sparse data from various corners of Europe that received lower dosages of the MFP for idiosyncratic historical reasons, they nevertheless illuminate a portion of the pathway that runs from the MFP through the historical dissolution of intensive kinship and into the minds of contemporary Europeans. Now let’s zoom in even closer to focus on an enduring puzzle in the social sciences: the Italian enigma.

pages: 1,034 words: 241,773

Enlightenment Now: The Case for Reason, Science, Humanism, and Progress by Steven Pinker

3D printing, Abraham Maslow, access to a mobile phone, affirmative action, Affordable Care Act / Obamacare, agricultural Revolution, Albert Einstein, Alfred Russel Wallace, Alignment Problem, An Inconvenient Truth, anti-communist, Anton Chekhov, Arthur Eddington, artificial general intelligence, availability heuristic, Ayatollah Khomeini, basic income, Berlin Wall, Bernie Sanders, biodiversity loss, Black Swan, Bonfire of the Vanities, Brexit referendum, business cycle, capital controls, Capital in the Twenty-First Century by Thomas Piketty, carbon footprint, carbon tax, Charlie Hebdo massacre, classic study, clean water, clockwork universe, cognitive bias, cognitive dissonance, Columbine, conceptual framework, confounding variable, correlation does not imply causation, creative destruction, CRISPR, crowdsourcing, cuban missile crisis, Daniel Kahneman / Amos Tversky, dark matter, data science, decarbonisation, degrowth, deindustrialization, dematerialisation, demographic transition, Deng Xiaoping, distributed generation, diversified portfolio, Donald Trump, Doomsday Clock, double helix, Eddington experiment, Edward Jenner, effective altruism, Elon Musk, en.wikipedia.org, end world poverty, endogenous growth, energy transition, European colonialism, experimental subject, Exxon Valdez, facts on the ground, fake news, Fall of the Berlin Wall, first-past-the-post, Flynn Effect, food miles, Francis Fukuyama: the end of history, frictionless, frictionless market, Garrett Hardin, germ theory of disease, Gini coefficient, Great Leap Forward, Hacker Conference 1984, Hans Rosling, hedonic treadmill, helicopter parent, Herbert Marcuse, Herman Kahn, Hobbesian trap, humanitarian revolution, Ignaz Semmelweis: hand washing, income inequality, income per capita, Indoor air pollution, Intergovernmental Panel on Climate Change (IPCC), invention of writing, Jaron Lanier, Joan Didion, job automation, Johannes Kepler, John Snow's cholera map, Kevin Kelly, Khan Academy, knowledge economy, l'esprit de l'escalier, Laplace demon, launch on warning, life extension, long peace, longitudinal study, Louis Pasteur, Mahbub ul Haq, Martin Wolf, mass incarceration, meta-analysis, Michael Shellenberger, microaggression, Mikhail Gorbachev, minimum wage unemployment, moral hazard, mutually assured destruction, Naomi Klein, Nate Silver, Nathan Meyer Rothschild: antibiotics, negative emissions, Nelson Mandela, New Journalism, Norman Mailer, nuclear taboo, nuclear winter, obamacare, ocean acidification, Oklahoma City bombing, open economy, opioid epidemic / opioid crisis, paperclip maximiser, Paris climate accords, Paul Graham, peak oil, Peter Singer: altruism, Peter Thiel, post-truth, power law, precautionary principle, precision agriculture, prediction markets, public intellectual, purchasing power parity, radical life extension, Ralph Nader, randomized controlled trial, Ray Kurzweil, rent control, Republic of Letters, Richard Feynman, road to serfdom, Robert Gordon, Rodney Brooks, rolodex, Ronald Reagan, Rory Sutherland, Saturday Night Live, science of happiness, Scientific racism, Second Machine Age, secular stagnation, self-driving car, sharing economy, Silicon Valley, Silicon Valley ideology, Simon Kuznets, Skype, smart grid, Social Justice Warrior, sovereign wealth fund, sparse data, stem cell, Stephen Hawking, Steve Bannon, Steven Pinker, Stewart Brand, Stuxnet, supervolcano, synthetic biology, tech billionaire, technological determinism, technological singularity, Ted Kaczynski, Ted Nordhaus, TED Talk, The Rise and Fall of American Growth, the scientific method, The Signal and the Noise by Nate Silver, The Spirit Level, The Wealth of Nations by Adam Smith, The Wisdom of Crowds, Thomas Kuhn: the structure of scientific revolutions, Thomas Malthus, total factor productivity, Tragedy of the Commons, union organizing, universal basic income, University of East Anglia, Unsafe at Any Speed, Upton Sinclair, uranium enrichment, urban renewal, W. E. B. Du Bois, War on Poverty, We wanted flying cars, instead we got 140 characters, women in the workforce, working poor, World Values Survey, Y2K

The peaks in the graph correspond to mass killings in the Indonesian anti-Communist “year of living dangerously” (1965–66, 700,000 deaths), the Chinese Cultural Revolution (1966–75, 600,000), Tutsis against Hutus in Burundi (1965–73, 140,000), the Bangladesh War of Independence (1971, 1.7 million), north-against-south violence in Sudan (1956–72, 500,000), Idi Amin’s regime in Uganda (1972–79, 150,000), Pol Pot’s regime in Cambodia (1975–79, 2.5 million), killings of political enemies in Vietnam (1965–75, 500,000), and more recent massacres in Bosnia (1992–95, 225,000), Rwanda (1994, 700,000), and Darfur (2003–8, 373,000).15 The barely perceptible swelling from 2014 to 2016 includes the atrocities that contribute to the impression that we are living in newly violent times: at least 4,500 Yazidis, Christians, and Shiite civilians killed by ISIS; 5,000 killed by Boko Haram in Nigeria, Cameroon, and Chad; and 1,750 killed by Muslim and Christian militias in the Central African Republic.16 One can never use the word “fortunately” in connection with the killing of innocents, but the numbers in the 21st century are a fraction of those in earlier decades. Of course, the numbers in a dataset cannot be interpreted as a direct readout of the underlying risk of war. The historical record is especially scanty when it comes to estimating any change in the likelihood of very rare but very destructive wars.17 To make sense of sparse data in a world whose history plays out only once, we need to supplement the numbers with knowledge about the generators of war, since, as the UNESCO motto notes, “Wars begin in the minds of men.” And indeed we find that the turn away from war consists in more than just a reduction in wars and war deaths; it also may be seen in nations’ preparations for war.

pages: 764 words: 261,694

The Elements of Statistical Learning (Springer Series in Statistics) by Trevor Hastie, Robert Tibshirani, Jerome Friedman

algorithmic bias, backpropagation, Bayesian statistics, bioinformatics, computer age, conceptual framework, correlation coefficient, data science, G4S, Geoffrey Hinton, greed is good, higher-order functions, linear programming, p-value, pattern recognition, random walk, selection bias, sparse data, speech recognition, statistical model, stochastic process, The Wisdom of Crowds

Linear models were largely developed in the precomputer age of statistics, but even in today’s computer era there are still good reasons to study and use them. They are simple and often provide an adequate and interpretable description of how the inputs affect the output. For prediction purposes they can sometimes outperform fancier nonlinear models, especially in situations with small numbers of training cases, low signal-to-noise ratio or sparse data. Finally, linear methods can be applied to transformations of the inputs and this considerably expands their scope. These generalizations are sometimes called basis-function methods, and are discussed in Chapter 5. In this chapter we describe linear methods for regression, while in the next chapter we discuss linear methods for classification.