Quantitative research in entrepreneurship using the R software for data analysis

Palavras-chave: Métodos quantitativos. Software R. Dados secundários. Objetivo do estudo: o presente texto visa apresentar um panorama sobre pesquisa quantitativa em empreendedorismo no Brasil, bem como descrever possibilidades para o avanço desta abordagem. Metodologia e abordagem: o artigo consiste em uma publicação conduzida a partir de levantamentos bibliográficos na literatura científica de empreendedorismo e discussões teóricas. Principais Resultados: maior parte das pesquisas nacionais em empreendedorismo são de natureza qualitativa. Apesar da relevância desta abordagem, acredita-se que a pesquisa quantitativa possui múltiplas potencialidades, sobretudo associada ao uso de dados oriundos de fontes secundárias. Principais Contribuições teóricas e metodológicas: apresentamos bases de dados públicas que podem ser empregadas por pesquisadores de empreendedorismo para avançar na teoria. Algumas estratégias de uso destas bases são exemplificadas por meio de um breve tutorial em linguagem R. Finalmente, debatemos acerca de estratégias para robustecer pesquisas quantitativas da área, bem como trazemos uma agenda de pesquisa. Relevância/Originalidade: são apresentados conteúdos que ainda são pouco explorados na literatura nacional, como o uso de dados secundários e machine learning. Contribuições sociais e gerenciais: algumas das bases apresentadas no estudo são de fonte governamental e podem ser utilizadas para fundamentar a construção de políticas públicas para o empreendedorismo. Ademais, os preceitos sobre pesquisa quantitativa apresentados neste editorial podem apoiar gestores que atuam com análises de dados na formulação de estudos mais robustos, independente da área de atuação, seja prático ou acadêmico. Resumo Pesquisa quantitativa em empreendedorismo e o apoio do software R para análise de dados


INTRODUCTION
The qualitative approach is the most used method in entrepreneurship research in Brazil. An analysis of papers published on Revista de Empreendedorismo e Gestão de Pequenas Empresas (REGEPE) between 2012 and 2022 found a higher number of articles which followed a qualitative approach -104 of the former and 62 of the latter. Furthermore, as illustrated in Figure  1, only in the last year there was an inversion on the prevalence of qualitative over quantitative methods.

Figure 1
Evolution of publications by approach Note: Elaborated by the authors.
Various reviews and bibliometric studies published in Brazil recently came to the same conclusion. In a review of the publications of the Encontro de Estudos Sobre Empreendedorismo e Gestão de Pequenas Empresas -EGEPE and the Encontro Nacional da Associação de Pós-Graduação e Pesquisa em Administração (Enanpad) between 2000 and 2008, Nassif et al. (2010) discovered a predominance of studies that employed qualitative methods: 60.7% of the 219 theoretical-empirical articles were qualitative. Oliveira et al. (2018) investigated entrepreneurship articles published in six management journals between 2000 and 2014 and noticed a prevalence of publications using qualitative methodologies. A total of 54 empirical studies were included in his search, with 51.9% being qualitative, 11.1% being mixed, and 37% being quantitative. Ferreira et al. (2020) revealed that 44% of the 179 articles published between 2004 and 2020 were qualitative, 27% quantitative, and 25% theoretical.
This characteristic distinguishes national research from the international field, where quantitative studies predominate. McDonald et al. (2015) conducted a survey in six of the major international journals on entrepreneurship from 1985 to 2013 and found that, in a sample of 3749 papers, the majority (55%) employed a quantitative approach. An updated analysis on the same journals used by McDonald et al. (2015) presented the same pattern. Except for Entrepreneurship & Regional Development Journal, the others publish quantitative studies more frequently. Besides, in a universe of 362 empirical articles, 69.06% were of a quantitative nature (see Figure 2 ). The greater number of qualitative studies is not necessarily a problem. Qualitative research is essential for the development of scientific knowledge in applied social sciences (Cristi, 2018), and it is no different in the field of entrepreneurship (Gil & Silva, 2015;Neergaard & Ulhoi, 2007). The issue is that the quantitative approach in entrepreneurship research has a history of low use in the national territory. The lack of quantitative research may prevent researchers from leveraging the benefits that this approach can provide, such as the ability to cover representative samples to validate theories developed and explored initially through qualitative methods, and generalization through sample designs and appropriate analysis techniques (Cooper & Schindler, 2014). Failure to advance quantitative research may represent a barrier to the development of the entrepreneurship field in Brazil.

Figure 2
Proportion of studies by type and journal Note: Elaborated by the authors.
The reason for the low number of quantitative publications is multifaceted. However, one possible explanation could be the limited use of secondary data. In fact, considering the survey carried out at REGEPE, cited above, in quantitative or mixed methods research, almost twice as many studies used primary data -49 used primary and 27 used secondary data. This finding was consistent with that found by Oliveira Junior et al. (2018), who discovered that 79.6% of studies employed primary surveys in a universe of 54 national empirical articles of entrepreneurship. Secondary data can increase the quantity and quality of quantitative research by avoiding the costs and time associated with collecting primary data and allowing studies with a greater number of observations (Hox & Boeije, 2005).
Knowing the available databases, as well as mastering the tools and methodologies for accessing and using these bases could improve the use of secondary data sources in research. In this regard, this article presents two contributions to the advancement of entrepreneurship research using a quantitative approach and secondary data: First, public national and international databases for entrepreneurship research are presented; second, a tutorial on the use of Software R for data in entrepreneurship is introduced.

SECONDARY DATA SOURCES FOR ENTREPRENEURSHIP RESEARCH
Obtaining data for entrepreneurship research is difficult, as it is in many other areas of Social Sciences. On the one hand, there is a scarcity of secondary sources on the early stages of business creation. On the other hand, from the standpoint of primary surveys, accessing entrepreneurs is tricky because they are busy individuals whose businesses are constantly changing, making it difficult to capture all the phenomena (Maula & Stam, 2020). However, in recent years, there has been an increase in the amount of data collected from different sources, such as database records, web scraping techniques, and videos (Maula & Stam, 2020;Obschonka & Audretsch, 2020). Because they use techniques for extracting, interpreting, and analyzing unstructured data, the last two sources -web scraping and videos -have great potential for qualitative and quantitative research (e.g., texts, images, videos). The first source is characterized by the data's structured nature. Each will be discussed in greater detail below.
Quantitative research in entrepreneurship using the R software for data analysis Pagotto, D. do P. & Borges, C. REGEPE Entrep. and Small Bus. J., v.12, n.2, May/Aug., 2023 Structured data are the most used secondary sources in entrepreneurship research. There are governmental and institutional databases, company records, and surveys in this universe devoted primarily to the study of the entrepreneurial phenomenon (for example, the Panel Study of Entrepreneurial Dynamics (PSED) and the Global Entrepreneurship Monitor (GEM)). Benatti et al. (2021), for example, used data from the Microempreendedor Individual (MEI), extracted from the Data Sebrae repository 1 , to assess the effect of this category of business on economic development of São Paulo state's municipalities. Audretsch et al. (2021) used a variety of data sources, including the GEM, to investigate the impact of institutional variables on opportunity and necessity entrepreneurship at the national level. Section 3 will cover a few of these databases in greater detail.
Exploration of unstructured sources -texts and videos -is unusual in entrepreneurship research, but it is becoming more common as data analysis tools, personal computer processing power, and cloud processing technologies advance. Web scraping is a method of collecting and extracting information from web pages (Prüfer & Prüfer, 2020). Obschonka et al. (2017), for example, investigated personality traits of "superstar" entrepreneurs and managers using data from Twitter user publications. Pagotto, Barbosa, et al. (2022) used Twitter to analyze the sentiment associated with tweets from entrepreneurs during the early stages of the Covid-19 pandemic. Previous experiences include extracting and analyzing texts from mainstream media such as The New York Times and Financial Times to assess differences in the content of publications about entrepreneurs (Suarez et al., 2020), as well as audio and video analysis of crowdfunding platforms to predict the success of fundraising campaigns (Kaminski & Hopp, 2020).
Considering the two sources mentioned, structured data investigations have been a reality in entrepreneurship research for some time. Unstructured data analysis, on the other hand, is already possible because there are programs with user-friendly interfaces that perform unstructured data processing, as well as libraries in programming languages such as R and Python. Thus, unstructured data analysis is viewed as a novel approach to measuring and comprehending phenomena in the field of entrepreneurship (Maula & Stam, 2020;von Bloh et al., 2020).
This presentation demonstrates that the analysis of unstructured data proves a promising path due to a number of factors, including increased data availability, amplification of the processing power of domestic and remote machines, traditional qualitative research software with more advanced textual analysis features (e.g., Nvivo, Atlas.ti), and packages in programming languages dedicated to this purpose. Data analysis tools and software that were previously used primarily in quantitative research can now be added to qualitative approaches, particularly when working with unstructured data, occasionally contributing to a narrowing between both perspectives and thus strengthening the investigation in the field. Despite the innovative nature of unstructured data, as demonstrated in section 2, we still have many opportunities to study entrepreneurship phenomena using structured secondary data.

STRUCTURED DATABASES IN ENTREPRENEURSHIP
The objective of this section is to present a few secondary databases on entrepreneurship, demonstrating their relevance and potential, and providing examples of their use in scientific studies. Table 1 lists some of the entrepreneurship bases. The foundations of PSED, GEM, and different national bases of governmental organizations are briefly described in the following paragraphs.
The Panel Study of Entrepreneurial Dynamics (PSED) was a survey led by Paul Reynolds with the purpose of collecting panel data from a representative sample of American entrepreneurs. Two editions of the PSED were conducted in the United States, and the most recent version, PSED 2, included the monitoring of emerging business through interview rounds conducted between 2006 and 2011. The PSED's main distinguishing feature is its longitudinal design, which makes possible to map various activities along the entrepreneurial process, such as identifying the opportunity, legalizing the company, making the first sale, and reaching the financial break-even point (Reynolds & Curtin, 2008). PSED data and supporting materials are freely available on the page: http:// www.psed.isr.umich.edu/psed. Other countries, such as Australia, China, and Sweden, have undertaken initiatives similar to the PSED, allowing the creation of a single harmonized database containing observations from all of these surveys (Arenius et al., 2017;Reynolds et al., 2016;Warhuus et al., 2021). For several reasons, the PSED base or surveys derived from it have great potential for further research, such as: 1) researchers have encouraged longitudinal studies of entrepreneurship because company formation is a dynamic process (Maula & Stam, 2020); 2) the databases contain a wide range of data on topics such as entrepreneur characteristics, the entrepreneurial process, the nascent company, financing, business strategies, social capital, community support, and motivations. Because of this large amount of data, the foundation has spread in entrepreneurship studies on topics such as family entrepreneurship (Dyer et al., 2013), social capital (Semrau & Hopp, 2016), and female entrepreneurship (Kwapsiz & Hechavarria, 2018), predicting the emergence and abandonment of new ventures (Koumbarakis & Volery, 2022), among others.
The Global Entrepreneurship Monitor (GEM) is a survey released in 1999 in multiple countries with the objective of tracking aspects of the population's entrepreneurial attitudes and behavior, as well as the perception of contextual conditions for entrepreneurship. These two dimensions of GEM analysis are translated into two annual surveys: 1) the Adult Population Survey, which investigates questions related to the adult population's perception of identifying business opportunities in their locality, perception of the capabilities to start a business, and initial entrepreneurship rate, among others; and 2) the National Expert Survey, which is a survey aimed at specialists to capture the perception of variables in the entrepreneurial context, such as funding for entrepreneurship, government support, taxation, and bureaucracy, among others. It should be noted that the GEM also publishes reports and studies on a variety of topics, including social entrepreneurship, family entrepreneurship, and female entrepreneurship.
Quantitative research in entrepreneurship using the R software for data analysis Pagotto, D. do P. & Borges, C. REGEPE Entrep. and Small Bus. J., v.12, n.2, May/Aug., 2023 The GEM bases, like the PSED, are widely used in entrepreneurship studies, and are frequently combined with other surveys, which broadens the investigation of entrepreneurship in relation to other phenomena. Here are two examples of database composition: Hechavarra and Ingram (2019) connected the GEM to World Bank databases to investigate the impact of the ecosystem on male and female entrepreneurship prevalence; Audretsch et al. (2021) used a combination of multiple data sources -Worldwide Governance Indicators (WGI), International Monetary Fund government spending, and GEM -to assess the impact of national institutions on the rate of entrepreneurship by necessity and opportunity. Section 5 will include an exercise for connecting the GEM to another base.
Although they are not solely dedicated to entrepreneurship research, certain government databases in Brazil hold great promise for the country's researchers. Some of them are listed below; they are publicly available and can be used to build quantitative studies using secondary data.
The Pesquisa Nacional por Amostra de Domicílio (PNADc), the Pesquisa Nacional de Saúde (PNS), the Censo Agropecuário, and the Pesquisa de Informações Básicas Municipais (MUNIC) are all conducted by the Brazilian Institute of Geography and Statistics (IBGE). Moreover, the Federal Revenue Service of Brazil (RFB) publishes data on the National Register of Legal Entities (CNPJ). Through the Department of Informatics of the Unified Health System (DATASUS), the Ministry of Health compulsorily consolidates a list of diseases that affect the population through the Sistema de Informações de Agravos e Notificações (SINAN), as well as variables on morbidity and mortality, which are listed, respectively, on the Sistema de Informações Hospitalares (SIH) and the Sistema de Informações sobre Mortalidade (SIM). The Ministry of Education compiles datasets on education at the municipal level. In collaboration with Endeavor, the National School of Public Administration (ENAP) recently published the latest Entrepreneurial Cities Index (ICE -2020), which includes data on the 100 largest Brazilian municipalities. Some of these databases disaggregate data at the municipality level (for example, MUNIC and ICE), while others disaggregate data at the individual level (e.g., SINAN).
When dealing with individual bases, an observation must be made before presenting examples of their application: in some of these surveys, entrepreneurs can be identified as employers or self-employed workers. The former is frequently associated with opportunity entrepreneurs, whereas the latter with necessity entrepreneurs (Naudé, 2010). However, such an association must be made with caution, because we find businesses started by opportunity or necessity in both groups. The use of self-employed workers as equivalents to entrepreneurs is a point of contention in the literature, and it requires theoretical and methodological advances that allow for the improvement of analyses.
Previous studies have been carried out using these bases, such as the application of SINAN to identify the profile of diseases that affect entrepreneurs (Barbosa & Borges, 2021), the use of RFB data to map female entrepreneurship in the state of Goiás (Pagotto et al., 2020), the use of multiple national databases to assess the association of socioeconomic factors and the proportion of MEIs in Minas Gerais municipalities (Morais et al., 2022), the use of PNADc to investigate the characteristics of self-employed workers (Rossi, 2018) informality (Santiago & Vasconcelos, 2017) and the relationship between entrepreneurship and economic growth (Barros & Pereira, 2008). In addition to the examples provided in the preceding paragraphs, we find Brazilian authors who have used other secondary data sources in conducting research published in high-impact journals, such as Fischer et al. (2018), who employed data from FAPESP and CNPq in a study on academic entrepreneurship.
Given the above-mentioned possibilities, the following two sections will be devoted to an introduction to the R software, including a brief presentation of the program and, in sequence, the application of an exploratory analysis resulting from the combination of two databases, the GEM and the WGI.

THE R SOFTWARE
Statistical packages have always been associated with quantitative entrepreneurship research. The R software is one of the tools that has gained traction in recent years. R is a programming environment and language that focuses on statistical analysis (Hornik, 2020). R, unlike traditional software used in applied social sciences such as SPSS and Stata, is free of charge. Furthermore, because it is a tool with a programming language interface and due to the functions included in the hundreds of packages that can be installed, it has greater functionality versatility. Furthermore, if the researcher has a built script, the analyses can be reproducible, which contributes to greater research transparency, a condition that is increasingly valued by the scientific community of entrepreneurship (Anderson et al., 2019;Maula & Stam, 2020).
When packages are added to the software, they perform various functions 2 such as data reading and processing, visualization, and quantitative analysis. Such packages are frequently created and improved by R users, allowing the community to contribute to the tool's continuous advancement. Table 2 lists a few R packages and the functionalities they provide. It should be noted that this is by no means an exhaustive list. The RProject 3 website contains a complete list of packages as well as their documentation. Researchers can use R alone. However, the language is typically manipulated using the RStudio® software, which is an integrated development environment with a more intuitive interface that provides a better experience of use.

A BRIEF TUTORIAL ON USING R FOR DATA IN ENTREPRENEURSHIP
We performed a few reading operations, data processing, and exploratory data analysis for this tutorial. However, it is expected that some of the lessons presented here, such as the use of joins, will be useful in expanding the horizons of possibilities for entrepreneurship research by allowing the combination of multiple databases. As demonstrated in Section 3, studies involving GEM frequently connect it to other bases. R and RStudio® must be installed on the computer or accessed via the RStudio Cloud tool 4 to perform the analyses. Furthermore, the spreadsheets containing the bases used in this case study, as well as the data dictionary from the annex of this document, must be accessed. The following procedures will be followed.

Performing exploratory data analysis and data visualization
Let's begin by loading the packages and reading the bases that will be used in the example (see Box 1). Two databases will be used for this, which were originally combined in a previous study (Audretsch et al., 2021): the Adult Population Survey (APS) from the Global Entrepreneurship Monitor in aggregate format and the Worldwide Governance Indicator (WGI). The aggregated GEM APS base, as previously stated, includes the results of a survey on perceptions of entrepreneurial behavior and attitudes by country.

Rule of Law WGI
The extent to which agents are trusted and follow society's rules, as well as the quality of contract enforcement, property rights, police, and the judiciary.

Regulatory quality WGI
Perception of the government's ability to develop and implement sound policies and regulations that promote private-sector development.

Political Stability WGI
Perception of the likelihood of unconstitutional measures causing instability or seizing power, violence, including politically motivated conditions, and terrorism.

Voice Accountability WGI
Perception of the extent to which citizens in the country can participate in the governing body, exercise free expression/assembly, and access to free media.
Note: Elaborated by the authors.
It is not the scope of the tutorial to delve into aspects of inferential statistics or machine learning, which would require greater theoretical depth to propose a model, as well as the leveling of knowledge in quantitative methods and assumption tests of statistical models.
If you're using RStudio®, the bases will be loaded in the Environment tab, which is usually located in the upper right corner of the program. Let's take a look at the variables using the dplyr package's glimpse() function. The glimpse function output shows that the WGI base has nine columns (variables) and 202 rows (observations). The first observations for each variable are displayed in front of it (see Box 2).
The goal now is to join both datasets. It is critical that both have a corresponding variable. The data dictionary and the initial inspection with the glimpse() function show that the code and abrev variables in the wgi and gem_aps databases are equivalent. The left_join function will then be used. We are telling R in the code at Box 3 to join the gem_aps and wgi datasets according to the abrev and code columns. Afterwards, the result will be saved in an object called gem_wgid.

Results of the skim() function
Note: Elaborated by the authors.
Finally, let's use the GGally package's ggpairs() function to generate a correlation matrix of the variables (see Box 6). Because the identification variables for country (economy) and continent (continent) are categorical, it was decided to remove them from the analysis using the function select (-variable name). The results are shown in Figure 4.

Result of the ggpairs() function
Note: Elaborated by the authors.
It is possible to determine that the variable entrepreneurship as a good career choice had a negative and significant correlation with the institutional variables. These, in turn, were highly correlated with one another, which is understandable given the phenomena they measure. Again, this case study is limited to analyzing data using R language functions and does not intend to delve into theoretical aspects.
Finally, let's dig a little deeper into the relationship between two variables: political stability and entrepreneurship as good career choice (see Box 7). The ggplot data visualization function was used for this. Within the aes argument, the x and y coordinates are linked to the variables political stability and entrepreneurship as a good career choice, respectively, in the first parenthesis.

Box 7
gem_wgid %>% ggplot(aes(x = political_stability, y = entrepreneurship_as_good_career_choice)) + geom_point(aes(col = continent, size = 1.5)) + geom_smooth(method = "lm", se = FALSE) + geom_text_repel(aes(label = economy)) + facet_grid(~continent) + theme_minimal() + ylab("Entrepreneurship as a good career of choice") + xlab("Political Stability") Next, we must specify the data layout format: points (geom_ point()) or a smooth line describing the relationship (geom_ smooth()). We can add parameters to both functions (e.g.: color the points according to the continents and increase the size of the points for better visualization). The geom_text_repel() function adds texts to each point based on the economy variable, while facet_ grid divides the data into multiple panels based on the continents variable. Finally, the function theme_minimal() adds a minimalist design. The xlab() and ylab() functions modify the axis titles based on the text we entered. The results is shown in Figure 5.
Quantitative research in entrepreneurship using the R software for data analysis Pagotto, D. do P. & Borges, C. REGEPE Entrep. and Small Bus. J., v.12, n.2, May/Aug., 2023 This tutorial is also available in video format on the Youtube channel of the Laboratório de Pesquisa em Empreendedorismo e Inovação of Universidade Federal de Goiás (LAPEI -UFG). In 2021, LAPEI-UFG promoted a R course applied to entrepreneurship research in collaboration with Associação Nacional de Estudos em Empreendedorismo e Gestão de Pequenas Empresas (ANEGEPE) e a Divisão Inovação, Tecnologia e Empreendedorismo da Associação Nacional de Pós-Graduação e Pesquisa em Administração (ITE-ANPAD). The course consisted of three modules that were delivered synchronously. The course had 161 participants from various education and research institutions throughout Brazil. The recordings had over 1500 views on YouTube® until October 2022. An assessment provided at the end of the training showed that the modules were rated between "satisfactory" (35%) and "very satisfactory" (65%). Participants emphasized the didactics and the quality of the materials available as strong points. Improvement opportunities included the division of shorter modules, more meetings, and meetings held outside of business hours at times.

ANALYTICAL APPROACHES TO ENTREPRENEURSHIP RESEARCH
The data analysis process must be tailored to the research question. The techniques, according to data analysis manuals, can be divided into interdependent and dependent and are associated with the type of relationship studied (Hair et al., 2009). The goal of interdependence analyses is to reduce, categorize, and group observations and/or variables. Techniques such as cluster analysis, principal component analysis, and factor analysis fall into this category. Dependency analysis refers to a class of techniques that attempt to estimate models that express the relationship between variables. In this regard, various regression techniques (e.g., linear, logistic, multinomial, negative binomial, quantile) and structural equation modeling are available (Favero & Belfiore, 2017). Some studies will be presented below that used techniques from both perspectives, dependence and interdependence. Canestrino et al., (2020) used a cluster analysis in one of the early stages of their research on cultural values and the prevalence of social entrepreneurship to identify countries with similar cultural characteristics. The researchers used data from the Global Leadership and Organizational Behavior Effectiveness (GLOBE) project, which collects managers' perceptions of Hofstede's cultural dimensions across multiple countries. This allowed three groups with relatively similar characteristics to be identified. The first cluster was dominated by northern European countries and was labeled as friendly, the second by Asian and African countries and was labeled pragmatic, and the third by countries from southern Europe and Latin America and was labeled progressive. Benatti et al. (2021) used dependency techniques to assess the relationship between MEI registration in municipalities in São Paulo and various economic indicators (the Municipal Gross Domestic Product -GDP-M -and Firjan Municipal Development Index -IFDM). The authors collected data from various secondary sources and used quantile regression in two models, both with the MEI record as an independent variable but two different dependent variables (GDP-M and IFDM). According to the study, the MEI has a greater impact on smaller municipalities as well as the IFDM's low and medium growth ranges. Pagotto, Borges, et al. (2022) employed PSED 2 data to determine the association of different forms of capital -human, financial, and social -in the development of innovative capabilities in start-ups, which is another example of research that used dependency techniques. Among the variables studied, personal financial resources, education, and social capital employed to access physical infrastructure were determinants of the development of innovation capabilities in emerging companies over time (Pagotto, Borges, et al. 2022).
For some time, these techniques have been consolidated and developed in the context of statistics. They are typically covered in Multivariate Analysis courses at the undergraduate and graduate levels. However, given the increasing availability of machine learning tools, entrepreneurship researchers have encouraged the use of this approach in their research (Chalmers et al., 2021;Maula & Stam, 2020;Prüfer & Prüfer, 2020).
Although statistics and machine learning are both based on data and use similar techniques, the two approaches have distinct goals, methods, and tools. On the one hand, statistics is primarily concerned with inference, whereas machine learning is more concerned with prediction (Bzdok et al., 2018). Other distinguishing features of both approaches emerge from this preliminary classification and will be discussed after the following example.
Predicting is defined as the ability to predict a future outcome based on current characteristics (James et al., 2013). As an example, consider the case depicted in Figure 6 of a public policy manager who wants to develop a predictive model to determine whether companies that access a credit line will repay the loan after three years.

Figure 6
Creation of a predictive model Note: Elaborated by the authors. REGEPE Entrep. and Small Bus. J., v.12, n.2, May/Aug., 2023 To achieve this goal, the manager will be able to train and validate a machine learning model to identify patterns in past data from companies that have faced similar challenges. The algorithm will map a large set of variables (e.g., number of entrepreneurs, gender of entrepreneurs, sector of activity, social capital, legal nature, location, family character, and so on) and look for patterns to form a function that describes the relationship. Following training, it is common practice to validate the algorithm's predictive capacity in a partitioning of its original database, known as the test dataset, to determine whether the model responds well to a subset of the data that did not participate in the training stage.
The manager will be able to read a new set of data that has the characteristics of the projects in his territory today and thus predict the chance of paying the loan after the desired period using the function developed based on the identification of past patterns and due care taken in the validation stage. It should be noted that the goal here is to perform prediction. Under these conditions, the function created may be difficult to interpret, depending on the algorithm used. As a result, while it is understood that it can accurately predict new observations, what lies behind it is not always clear.
Consider the following scenario: a researcher wishes to improve interpretability and comprehend how certain variables affect loan repayment. In this case, the investigator will approach the problem from an inference standpoint, which is traditionally associated with statistics (Bzdok et al., 2018).
Some consequences of the prediction/inference relationship are highlighted in the example. Machine learning methods are better at identifying patterns in large databases with many variables, whereas statistics focuses on a smaller set of variables with a wider range of observations. Furthermore, because of the flexibility with which patterns can be calculated, some machine learning algorithms can have good predictive power by creating sophisticated functions that describe the investigated relationships; however, they can provide low interpretability, which is required to perform inferences (Bzdok et al., 2018).

HOW TO ADVANCE IN THE LITERATURE ON ENTREPRENEURSHIP WITH THE SUPPORT OF TOOLS SUCH AS R
This section brings practices that can be used to advance quantitative research in entrepreneurship with the help of tools such as R. The points highlighted in this subsection are a collection of editorial discussions on quantitative methods in entrepreneurship research, such as the use of exploratory data analyses, actions to improve quantitative studies, and analysis publicity. First, researchers should rely on exploratory data analysis techniques more frequently. Such techniques are typically recommended prior to the use of advanced multivariate modeling because they allow the discovery of patterns in variable distributions as well as the identification of missing data or outliers. They are, however, particularly useful for elucidating poorly understood phenomena. Descriptive analyses (including measures beyond the mean and standard deviation, such as minimum and maximum), cluster analysis, principal component analysis, and pattern identification in data visualization tools are examples of exploratory analysis techniques. The use of exploratory data techniques such as topic modeling, clustering, and network analysis can provide valuable research insights (Wennberg & Anderson, 2020). Anderson et al. (2019) identify three important factors for the advancement of theoretical-empirical articles in the field of entrepreneurship: 1) the research question that drives the study; 2) the conditions that help improve causal inferences; and 3) the procedures used to reduce researcher bias. Concerning the first point, it is critical that the research question is qualified, and that the method used to answer it is adequate (Maula & Stam, 2020). Regarding the improvement of causal inferences, there is a stimulus for experimental research designs, regardless of their applicability, including their rigor in dealing with endogeneity problems and the ability to demonstrate causality relationships (Anderson et al., 2019;Maula & Stam, 2020).
The "hunting for asterisks" is one thing to avoid. This is a researcher's behavior in which there is a bias due to the need to obtain significant results in analyses, which leads to practices such as p-hacking and HARKing 5 . As Anderson et al. (2019, p. 4) emphasizes, "Researchers can publish good entrepreneurship studies, asking interesting questions and applying rigorous research designs regardless of identifying significant results." On the other hand, researchers must be aware of the magnitude of the effect identified in the model results. After all, a significant p-value does not imply that the predictor variable will have a practical effect on the variation of a dependent variable.
Another useful practice that has been promoted is the public dissemination of data and codes. Databases provided by researchers are assigned a Document Object Identifier (DOI) by platforms such as Researchgate and Data Mendeley. Rmarkdown (a file format for R that allows for the generation of reports), Google Colab, Jupyter Notebook, and Github can all be used to document the analysis performed.

AGENDA
Given what has been presented, entrepreneurship researchers can benefit from the increasing data availability as well as increasingly versatile and powerful software tools. As a result, the purpose of this section is to suggest potential research directions based on the discussions raised in this study.
As demonstrated in Section 3, there is a large volume of unstructured data available. Researchers have already investigated the potential of this type of data, conducting studies using data from social networks (Obschonka et al., 2017;Pagotto, Barbosa, et al. 2022), large media outlets (Suarez et al., 2020), and crowdfunding platforms. Future research may attempt to answer some of the following questions based on this type of data and previous research: What are the representations of entrepreneurship that the mainstream media creates? What are the major media outlets' discourses on entrepreneurs in Brazil? (Suarez et al., 2020).
Many national surveys, such as IBGE, do not include the term "entrepreneur" among job categories. Self-employed workers and employers are the two most closely related groups to entrepreneurship. Hence, a second research avenue would be to delve deeper into studies of self-employed workers. Efforts to explore deeper into this occupational category can already be seen in the international literature; one example is a special edition of Small Business Economics on the subject due out in 2020 (Burke & Cowling, 2020). Although still in its early stages, documented experiences with the use of IBGE databases to conduct entrepreneurship research exist in Brazil (e.g., Almeida et al., 2017).
According to national and international discussions, selfemployed workers are a growing profile (IBGE, 2021), diverseranging from garbage collectors to doctors, in the words of Santiago and Vasconcelos (2017) - (Burke & Cowling, 2020;Moortel & Vanroelen, 2017), and, on average, more vulnerable than employed workers. Therefore, more research into this profile, its context, and entrepreneurial process is required. Furthermore, studies can be conducted using Brazilian databases that consider the occupational profile to better segment the Brazilian self-employed worker.
Another path for future research is to assess the potential of data to investigate the entrepreneurial phenomenon at various levels and from a multilevel perspective. Datasets of the RFB, ICE, Quantitative research in entrepreneurship using the R software for data analysis Pagotto, D. do P. & Borges, C. REGEPE Entrep. and Small Bus. J., v.12, n.2, May/Aug., 2023 and MUNIC can be linked to other bases at the municipal level to investigate the impact of contextual and institutional variables on entrepreneurship (Muñoz-Fernández et al., 2019). Morais et al. (Audretsch & Moog, 2020) used this strategy by combining data from various sources (e.g., FIRJAN, RAIS, CAGED, DATASUS, INEP, IBGE) to assess the relationship between socioeconomic variables (e.g., income, education, health) and the proportion of MEIs at the municipal level. Regarding the country level, datasets such as the GEM can be used to better understand the relationship between entrepreneurship and other contextual conditions such as democracy (Audretsch & Moog, 2020).

Conflit of interest statement
The authors declare that there is no conflict of interest.