Skip to main content

Involvement of Domain Experts in the AI Training Does not Affect Adherence: An AutoML Study

  • Conference paper
  • First Online:
Advances in Information and Communication (FICC 2024)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 919))

Included in the following conference series:

  • 115 Accesses

Abstract

AutoML is a promising field of Machine Learning (ML) that is supposed to bring the advantages of artificial intelligence to a wide range of organizations in plentiful domains, by automating the process of ML-model creation without requiring prior knowledge in data science or programming. However, AutoML often appears to users as a black-box model created in a black-box process, negatively impacting users’ trust. Additionally, AutoML users are often experts in their respective domains (physicians, engineers, etc.), which are commonly observed to exhibit stronger algorithm aversion than lay people, i.e., having more difficulties trusting and relying on AI recommendations. User non-adherence to AutoML may have high-cost consequences, resulting in inefficient decisions and mitigating the overall progress in the AutoML field. Therefore, we investigate how domain experts’ adherence to AutoML recommendations can be fostered. As involvement of users in product creation processes was shown to positively affect their attitudes towards the product in multiple contexts, we argue that involving domain experts in AutoML-model creation processes may increase their trust and adherence to AutoML. We conduct an experimental laboratory study, in which subjects act as expert engineers and need to foresee machine malfunctions, while being advised by an AutoML-model. We apply three treatments – zero, passive & active involvement – to investigate our hypothesis. We observe that higher involvement leads to a higher perceived influence on the AutoML model and a higher perceived understanding of its functionality. However, these perceptions are not reflected in the actual behavior – subjects across all groups demonstrate similar AI adherence.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    These correspond to underlying probabilities of 5%, 35%, 65% and 95% for a malfunction occurring this round (unknown to subjects).

  2. 2.

    Subjects are instructed to set their personal acceptable ranges in a way to contain as many points without past malfunctions (green) and as few points with past malfunctions (red) as possible. This is not trivial, as there is no obvious correct solution, just tendencies towards more or less efficient intervals.

  3. 3.

    From the training stage onward, acceptable ranges are displayed as grey bars while the historical data points are hidden for the purpose of conciseness.

  4. 4.

    The difference in the AI accuracy is introduced in order to encourage subjects to take the training seriously and to execute the necessary effort. Those who do not succeed during the training stage and receive a low-accuracy AI-advisor are excluded from the data analysis afterwards.

  5. 5.

    During the experiment, all amounts are denoted in the fictitious experimental currency “Taler” which is exchanged to Euro at €0.1 per Taler at the end of the experiment.

  6. 6.

    Stage four was the only stage to feature direct monetary incentives. However, at stages two and three subjects were informed that their decisions would have implications for their performance in stage four, thereby providing an indirect incentive to exert effort in the earlier stages as well.

  7. 7.

    Observations are pooled at the subject-level.

  8. 8.

    Values are rounded for the purpose of readability; Table 5 contains precise values.

  9. 9.

    A full list of items as well as between-treatment comparisons can be obtained from Table 9 in Appendix A.

References

  1. Singh, V., Joshi, K.: Automated Machine Learning (AutoML): an overview of opportunities for application and research. J. Inf. Technol. Case and Application Res. 24, 1–11 (2022)

    Article  Google Scholar 

  2. Karmaker, S.K., Hassan, M.M., Smith, M.J., Xu, L., Zhai, C., Veeramachaneni, K.: AutoML to date and beyond: challenges and opportunities. ACM Comput. Surv. 54(8), 1–36 (2020)

    Article  Google Scholar 

  3. Zöller, M.-A., Titov, W., Schlegel, T., Huber, M.: XAutoML: A Visual Analytics Tool for Understanding and Validating Automated Machine Learning. arXiv:2202.11954 [cs.LG] (2022)

  4. Mahmud, H., Islam, A.N., Ahmed, S.I., Smolander, K.: What influences algorithmic decision-making? a systematic literature review on algorithm aversion. Technol. Forecast. Soc. Chang. 175, 121390 (2022)

    Article  Google Scholar 

  5. Dietvorst, B.J., Simmons, J.P., Massey, C.: Algorithm aversion: people erroneously avoid algorithms after seeing them err. J. Exp. Psychol. Gen. 144(1), 114–126 (2015)

    Article  Google Scholar 

  6. Logg, J.M., Minson, J.A., Moore, D.A.: Algorithm appreciation: people prefer algorithmic to human judgment. Organ. Behav. Hum. Decis. Process. 151, 90–103 (2019)

    Article  Google Scholar 

  7. Filiz, I., Judek, J., Lorenz, M., Spiwoks, M.: The Tragedy of Algorithm Aversion. Wolfsburg Working Papers No. 21–02 (2021)

    Google Scholar 

  8. Jussupow, E., Benbasat, I., Heinzl, A.: Why are we averse towards algorithms? a comprehensive literature review on algorithm aversion. In: 28th European Conference on Information Systems - Liberty, Equality, and Fraternity in a Digitizing World, ECIS 2020, Marrakech, Morocco June 15–17, 2020, Proceedings (2020)

    Google Scholar 

  9. He, J., King, W.R.: The role of user participation in information systems development: implications from a meta-analysis. J. Manag. Inf. Syst. 25(1), 301–331 (2008)

    Article  Google Scholar 

  10. Norton, M.I., Mochon, D., Ariely, D.: The IKEA effect: when labor leads to love. J. Consum. Psychol. 22(3), 453–460 (2012)

    Article  Google Scholar 

  11. Sarstedt, M., Neubert, D., Barth, K.: The IKEA effect. a conceptual replication. J. Marketing Behavior 2, 307–312 (2016)

    Google Scholar 

  12. Dietvorst, B.J., Simmons, J.P., Massey, C.: Overcoming algorithm aversion: people will use imperfect algorithms if they can (even slightly) modify them. Manage. Sci. 64(3), 1155–1170 (2018)

    Article  Google Scholar 

  13. Kawaguchi, K.: When will workers follow an algorithm? a field experiment with a retail business. Manage. Sci. 67(3), 1670–1695 (2021)

    Article  Google Scholar 

  14. Yeomans, M., Shah, A., Mullainathan, S., Kleinberg, J.: Making sense of recommendations. Inf. Syst. Res. 32(4), 403–414 (2019)

    Google Scholar 

  15. Jago, A.S.: Algorithms and authenticity. Academy of Management Discoveries 5(1), 38–56 (2019)

    Article  Google Scholar 

  16. Falk, A., Heckman, J.J.: Lab experiments are a major source of knowledge in the social sciences. Science 326(5952), 535–538 (2022)

    Article  Google Scholar 

  17. Chen, D.L., Schonger, M., Wickens, C.: OTree — an open-source platform for laboratory, online, and field experiments. J. Behav. Exp. Financ. 9, 88–97 (2016)

    Article  Google Scholar 

  18. Greiner, B.: Subject pool recruitment procedures: organizing experiments with ORSEE. J. Economic Science Association 1(1), 114–125 (2015). https://doi.org/10.1007/s40881-015-0004-4

    Article  Google Scholar 

  19. Gigerenzer, G., Hoffrage, U.: How to improve Bayesian reasoning without instruction: frequency formats. Psychol. Rev. 102(4), 684–704 (1995)

    Article  Google Scholar 

  20. Denrell, J., March, J.G.: Adaptation as information restriction: the hot stove effect. Organ. Sci. 12(5), 523–659 (2001)

    Article  Google Scholar 

  21. Croson, R.: The method of experimental economics. Int. Negot. 10, 131–148 (2005)

    Article  Google Scholar 

  22. Bingley, W.J., et al.: Where is the human in human- centered AI? insights from developer priorities and user experiences. Comput. Hum. Behav. 141, 107617 (2023)

    Article  Google Scholar 

  23. Saranya, A., Subhashini, R.: A systematic review of Explainable Artificial Intelligence models and applications: recent developments and future trends. Decision Analytics J. 7, 100230 (2023)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marius Protte .

Editor information

Editors and Affiliations

Appendices

Appendix A: Additional Tables and Graphics

Table 8. Distribution of study majors among participants
Fig. 6.
figure 6

Illustration of decision process for each round of maintenance stage

Table 9. Summary statistics and between-group comparison of questionnaire items on treatment perception

This table reports summary statistics of questionnaire items on treatment perception measured on a 5-point scale by treatment. Standard deviations reported in parenthesis. Kruskal-Wallis-H reports the p-values for Kruskal-Wallis H-Tests (KWH) with ties between experimental groups. Includes only subjects with high-accuracy AI.

Please Note.

The following Appendices, B and C, contain the experimental instructions and experimental questionnaire exactly in the form in which they were presented to the participants. Therefore, we abstain from numbering the contained figures and tables as well as referring them in the paper individually. Instead, Appendices B and C are referenced in their entirety.

Appendix B: Experimental Instructions (Active Involvement Group Exemplary)

Scenario

  • Over the course of the experiment, you assume the role of a skilled worker in an industrial company. You will be responsible for the operation of a production facility.

  • There is a certain probability that a malfunction may occur in the production facility. To avoid potential malfunctions, maintenance can be performed.

  • Your task is to evaluate the probability of malfunctions in multiple rounds and then decide whether the production plant should be maintained in a given round. You will be supported in your task by an artificial intelligence (AI).

Malfunction Probability

  • The probability of a malfunction is unknown but can be estimated using three indicators.

  • These indicators are: Temperature, Speed and Voltage. Each of the indicators can take values between 0 and 100.

  • Each of the indicators has its own optimal range. If an indicator is in its optimal range, this is particularly good for the production plant corresponds to a malfunction being less likely.

  • The more indicators’ values are located outside their respective optimum ranges, the more likely a malfunction becomes.

  • If all three indicators are within their optimal ranges and none are outside, a malfunction is very UNlikely.

  • If two indicators are within their optimal ranges and one is outside, a malfunction is UNlikely.

  • If one indicator is within their optimal range and two are outside, a malfunction is likely.

  • If all three indicators are outside their optimal ranges and none inside, a malfunction is very likely.

    figure a

Example: The graphic above displays an example for the optimal ranges (green bars) for the three indicators (orange dots). In this example, the “Temperature” indicator is located outside its optimal range and the “Speed” and “Voltage” indicators are located inside their respective optimal ranges. Accordingly, a malfunction would be considered unlikely in this case.

  • Important: In the experiment, you do not know the optimal ranges. Instead, you must estimate them as accurately as possible, based on the data points of past malfunctions. This estimation is called “acceptable ranges” (see “Procedure”).

Support by an AI

  • In each round, the AI predicts the probability of a malfunction and, based on this prediction, gives you a non-binding recommendation as to whether maintenance should be performed.

  • The accuracy of the AI predictions can vary. It depends on how the AI has been trained. Training the AI is part of the experiment (see below). You will be informed about the achieved accuracy of the AI (in percent) in the experiment at the end of Stage 3.

Procedure

The experiment consists of four stages that build upon each other.

figure b

Stage 1: Comprehension Checks

  • In this stage, comprehension checks are conducted about the instructions. Only once you have answered all the control questions correctly the experiment can begin. You have an unlimited number of attempts to answer the questions correctly.

Stage 2: Selection of Acceptable Ranges

  • As mentioned, the optimal ranges of the individual indicators are unknown to you. Instead, you must define an acceptable range for each indicator.

  • An acceptable range is an approximation of the actual (unknown) optimal range. The closer the acceptable ranges you set are to the optimal range, the better the AI’s advice will be.

  • Data about past malfunctions is available for you to set your acceptable ranges:

    • For each indicator individually, you can see at which values there were malfunctions in the past (red dots) and at which there were not (green dots).

    • You are now asked to define a lower limit (minimum) and an upper limit (maximum) of your acceptable range (green dashes). In general, an acceptable range should contain as many points without malfunctions (green) and as few points with malfunctions (red) as possible.

      figure c
    • At the beginning, you will be given an example that you can use to practice setting the limits (technical note: the limit that is closer to your mouse pointer moves in each case).

    • After you have set and confirmed the acceptable ranges, they will be displayed as gray bars in the further course of the experiment for the sake of conciseness (see figure). The individual data points are hidden.

      figure d
  • You define a total of three acceptable ranges (one for each indicator), which you will necessarily need in the further course of the experiment.

  • Your acceptable ranges will be displayed for all further decisions, so you do not have memorize or note them.

Stage 3: Training the AI

  • In this stage, the AI is trained based on your acceptable ranges defined in Stage 2. The AI thus learns how to evaluate the probability of malfunctions for different indicator combinations.

  • The training of the AI happens through ten training situations as follows:

    • Each training situation represents a combination of the three indicators’ values. These values are shown together with their acceptable ranges.

    • The following figure provides an example of a training situation. Orange bars represent the indicators’ values. Gray bars represent the acceptable ranges.

      figure e
    • For each training situation, you can see which indicators are within and which are outside your defined acceptable ranges.

    • Your task is to tell the AI how each training situation is to be evaluated regarding the likelihood of a malfunction. In doing so, you help the AI learn.

    • Use your acceptable ranges and your knowledge about the probability of malfunctions for the evaluation:

      • If three indicators are within your acceptable ranges and zero are outside, a malfunction is very UNlikely.

      • If two indicators are within your acceptable ranges and one is outside, a malfunction is UNlikely.

      • If one indicator is within your acceptable ranges and two are outside, a malfunction is likely.

      • If there are zero indicators inside your acceptable ranges and three outside, a malfunction is very likely.

    • Each of your ten malfunction probability assessments is then checked for correctness:

      • If a malfunction has been classified as “very unlikely” or “unlikely” and no malfunction has actually occurred, the assessment is considered correct and otherwise incorrect.

      • If a malfunction was classified as “very likely” or “likely” and a malfunction actually

      • occurred, the evaluation is considered correct and otherwise incorrect.

      • Whether a malfunction actually occurs or not depends on the actual optimum ranges, which remain unknown.

  • The results of training, and thus the quality of the AI, depends on how many training situations have been correctly assessed. You will be informed about the result at the end of the training stage. Two results are possible:

    • If at least seven training situations were evaluated correctly, you will receive an AI with the accuracy of 90% (on average it is correct in 9 out of 10 cases and wrong in one out of 10 cases).

    • If less than seven training situations were evaluated correctly, you will receive an AI with the accuracy of 50% (it is correct on average in 5 out of 10 cases and wrong in 5 out of 10 cases).

    • After completing this stage, the AI has learned to evaluate malfunction probabilities in comparable situations through the training situations.

    • In stage 4, you can use the AI for decision support.

Stage 4: Production Plant Surveillance

  • This stage consists of 25 rounds.

  • In each round you have to make the decision whether to maintain the production facility.

  • All rounds are independent of each other, i.e., the decision in one round does not affect other rounds.

  • In each round, you will receive a graphic showing the values of the three indicators and your self-defined acceptable ranges (see Stage 3).

  • In each round, you make your decision in two steps:

    • In the first step, you evaluate the given situation in terms of the probability of failure and decide whether maintenance should be performed.

    • In the second step, the AI’s recommendation is displayed to you. Afterwards, you are asked again whether you want to perform maintenance.

  • Only the decision in the second step is relevant for your payoff in the respective round.

  • Whether a malfunction actually occurs or not depends on the optimal ranges, which remain unknown. You will only find out at the end of the experiment how often you were correct and how high your payoff will be.

Payoffs

  • During the experiment, all amounts are denoted in the fictitious currency “Taler”.

  • Per round, depending on your maintenance decision and the occurrence/non-occurrence of a malfunction, you will receive the following payoffs:

  • You decide that maintenance should be performed.

    • Maintenance limits your production capacities. Therefore, your payoff this round is 5 Taler.

  • You decide that no maintenance should be performed.

    • If no malfunction occurs and you can therefore produce fully, your payoff from this round is 10 Taler.

    • If a malfunction occurs and therefore you cannot produce, your payoff from this round is 0 Taler.

  • The payoffs from all rounds are cumulated.

  • At the end of the experiment, you will receive your payoffs at an exchange rate of 1€ per 10 Taler. In addition, you will receive a show-up fee of 2.50 €.

Additional Remarks

  • All communication is prohibited for the duration of the experiment except for communication explicitly permitted by the instructions.

  • Mobile phones must be turned off for the duration of the experiment.

  • All decisions within the scope of the experiment will remain completely anonymous.

  • After completing the main part of the experiment, we kindly ask you to answer some additional questions. Answering the questions honestly and in full is very important for the subsequent analysis of the experiment. The answers to the questions remain anonymous and will only be evaluated for scientific purposes. Your answers in this questionnaire have no impact on your payoff achieved in the experiment.

Appendix C: Questionnaire

Please answer the following questions.

What is your age?

What is your gender?

  • Male

  • Female

  • Non-Binary

What is your highest level of education?

  • Highschool/GED

  • Undergraduate degree

  • Graduate degree

  • Else/Prefer not to say

What is your current study major?

Please answer the following questions

Please indicate your consent with the following statement on a scale from 1 (= completely disagree) to 7 (= completely agree).

figure f

Please indicate your consent with the following statement on a scale from 1 (= Do not consent at all) to 5 (= Fully consent).

figure g

Please indicate your consent with the following statement on a scale from 1 (= Do not consent at all) to 5 (= Fully consent).

figure h

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lebedeva, A., Protte, M., van Straaten, D., Fahr, R. (2024). Involvement of Domain Experts in the AI Training Does not Affect Adherence: An AutoML Study. In: Arai, K. (eds) Advances in Information and Communication. FICC 2024. Lecture Notes in Networks and Systems, vol 919. Springer, Cham. https://doi.org/10.1007/978-3-031-53960-2_13

Download citation

Publish with us

Policies and ethics