Involvement of Domain Experts in the AI Training Does not Affect Adherence: An AutoML Study

Lebedeva, Anastasia; Protte, Marius; van Straaten, Dirk; Fahr, René

doi:10.1007/978-3-031-53960-2_13

Anastasia Lebedeva^10,11,
Marius Protte^10,12,
Dirk van Straaten¹⁰ &
…
René Fahr^10,12

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 919))

Included in the following conference series:

Future of Information and Communication Conference

115 Accesses

Abstract

AutoML is a promising field of Machine Learning (ML) that is supposed to bring the advantages of artificial intelligence to a wide range of organizations in plentiful domains, by automating the process of ML-model creation without requiring prior knowledge in data science or programming. However, AutoML often appears to users as a black-box model created in a black-box process, negatively impacting users’ trust. Additionally, AutoML users are often experts in their respective domains (physicians, engineers, etc.), which are commonly observed to exhibit stronger algorithm aversion than lay people, i.e., having more difficulties trusting and relying on AI recommendations. User non-adherence to AutoML may have high-cost consequences, resulting in inefficient decisions and mitigating the overall progress in the AutoML field. Therefore, we investigate how domain experts’ adherence to AutoML recommendations can be fostered. As involvement of users in product creation processes was shown to positively affect their attitudes towards the product in multiple contexts, we argue that involving domain experts in AutoML-model creation processes may increase their trust and adherence to AutoML. We conduct an experimental laboratory study, in which subjects act as expert engineers and need to foresee machine malfunctions, while being advised by an AutoML-model. We apply three treatments – zero, passive & active involvement – to investigate our hypothesis. We observe that higher involvement leads to a higher perceived influence on the AutoML model and a higher perceived understanding of its functionality. However, these perceptions are not reflected in the actual behavior – subjects across all groups demonstrate similar AI adherence.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
These correspond to underlying probabilities of 5%, 35%, 65% and 95% for a malfunction occurring this round (unknown to subjects).
2.
Subjects are instructed to set their personal acceptable ranges in a way to contain as many points without past malfunctions (green) and as few points with past malfunctions (red) as possible. This is not trivial, as there is no obvious correct solution, just tendencies towards more or less efficient intervals.
3.
From the training stage onward, acceptable ranges are displayed as grey bars while the historical data points are hidden for the purpose of conciseness.
4.
The difference in the AI accuracy is introduced in order to encourage subjects to take the training seriously and to execute the necessary effort. Those who do not succeed during the training stage and receive a low-accuracy AI-advisor are excluded from the data analysis afterwards.
5.
During the experiment, all amounts are denoted in the fictitious experimental currency “Taler” which is exchanged to Euro at €0.1 per Taler at the end of the experiment.
6.
Stage four was the only stage to feature direct monetary incentives. However, at stages two and three subjects were informed that their decisions would have implications for their performance in stage four, thereby providing an indirect incentive to exert effort in the earlier stages as well.
7.
Observations are pooled at the subject-level.
8.
Values are rounded for the purpose of readability; Table 5 contains precise values.
9.
A full list of items as well as between-treatment comparisons can be obtained from Table 9 in Appendix A.

References

Singh, V., Joshi, K.: Automated Machine Learning (AutoML): an overview of opportunities for application and research. J. Inf. Technol. Case and Application Res. 24, 1–11 (2022)
Article Google Scholar
Karmaker, S.K., Hassan, M.M., Smith, M.J., Xu, L., Zhai, C., Veeramachaneni, K.: AutoML to date and beyond: challenges and opportunities. ACM Comput. Surv. 54(8), 1–36 (2020)
Article Google Scholar
Zöller, M.-A., Titov, W., Schlegel, T., Huber, M.: XAutoML: A Visual Analytics Tool for Understanding and Validating Automated Machine Learning. arXiv:2202.11954 [cs.LG] (2022)
Mahmud, H., Islam, A.N., Ahmed, S.I., Smolander, K.: What influences algorithmic decision-making? a systematic literature review on algorithm aversion. Technol. Forecast. Soc. Chang. 175, 121390 (2022)
Article Google Scholar
Dietvorst, B.J., Simmons, J.P., Massey, C.: Algorithm aversion: people erroneously avoid algorithms after seeing them err. J. Exp. Psychol. Gen. 144(1), 114–126 (2015)
Article Google Scholar
Logg, J.M., Minson, J.A., Moore, D.A.: Algorithm appreciation: people prefer algorithmic to human judgment. Organ. Behav. Hum. Decis. Process. 151, 90–103 (2019)
Article Google Scholar
Filiz, I., Judek, J., Lorenz, M., Spiwoks, M.: The Tragedy of Algorithm Aversion. Wolfsburg Working Papers No. 21–02 (2021)
Google Scholar
Jussupow, E., Benbasat, I., Heinzl, A.: Why are we averse towards algorithms? a comprehensive literature review on algorithm aversion. In: 28th European Conference on Information Systems - Liberty, Equality, and Fraternity in a Digitizing World, ECIS 2020, Marrakech, Morocco June 15–17, 2020, Proceedings (2020)
Google Scholar
He, J., King, W.R.: The role of user participation in information systems development: implications from a meta-analysis. J. Manag. Inf. Syst. 25(1), 301–331 (2008)
Article Google Scholar
Norton, M.I., Mochon, D., Ariely, D.: The IKEA effect: when labor leads to love. J. Consum. Psychol. 22(3), 453–460 (2012)
Article Google Scholar
Sarstedt, M., Neubert, D., Barth, K.: The IKEA effect. a conceptual replication. J. Marketing Behavior 2, 307–312 (2016)
Google Scholar
Dietvorst, B.J., Simmons, J.P., Massey, C.: Overcoming algorithm aversion: people will use imperfect algorithms if they can (even slightly) modify them. Manage. Sci. 64(3), 1155–1170 (2018)
Article Google Scholar
Kawaguchi, K.: When will workers follow an algorithm? a field experiment with a retail business. Manage. Sci. 67(3), 1670–1695 (2021)
Article Google Scholar
Yeomans, M., Shah, A., Mullainathan, S., Kleinberg, J.: Making sense of recommendations. Inf. Syst. Res. 32(4), 403–414 (2019)
Google Scholar
Jago, A.S.: Algorithms and authenticity. Academy of Management Discoveries 5(1), 38–56 (2019)
Article Google Scholar
Falk, A., Heckman, J.J.: Lab experiments are a major source of knowledge in the social sciences. Science 326(5952), 535–538 (2022)
Article Google Scholar
Chen, D.L., Schonger, M., Wickens, C.: OTree — an open-source platform for laboratory, online, and field experiments. J. Behav. Exp. Financ. 9, 88–97 (2016)
Article Google Scholar
Greiner, B.: Subject pool recruitment procedures: organizing experiments with ORSEE. J. Economic Science Association 1(1), 114–125 (2015). https://doi.org/10.1007/s40881-015-0004-4
Article Google Scholar
Gigerenzer, G., Hoffrage, U.: How to improve Bayesian reasoning without instruction: frequency formats. Psychol. Rev. 102(4), 684–704 (1995)
Article Google Scholar
Denrell, J., March, J.G.: Adaptation as information restriction: the hot stove effect. Organ. Sci. 12(5), 523–659 (2001)
Article Google Scholar
Croson, R.: The method of experimental economics. Int. Negot. 10, 131–148 (2005)
Article Google Scholar
Bingley, W.J., et al.: Where is the human in human- centered AI? insights from developer priorities and user experiences. Comput. Hum. Behav. 141, 107617 (2023)
Article Google Scholar
Saranya, A., Subhashini, R.: A systematic review of Explainable Artificial Intelligence models and applications: recent developments and future trends. Decision Analytics J. 7, 100230 (2023)
Google Scholar

Download references

Author information

Authors and Affiliations

Heinz Nixdorf Institute, Fürstenallee 11, 33102, Paderborn, Germany
Anastasia Lebedeva, Marius Protte, Dirk van Straaten & René Fahr
Weidmüller Interface GmbH and Co., KG, Zukunftsmeile 2, 33102, Paderborn, Germany
Anastasia Lebedeva
Department Management, Paderborn University, Warburger Street 100, 33098, Paderborn, Germany
Marius Protte & René Fahr

Authors

Anastasia Lebedeva
View author publications
You can also search for this author in PubMed Google Scholar
Marius Protte
View author publications
You can also search for this author in PubMed Google Scholar
Dirk van Straaten
View author publications
You can also search for this author in PubMed Google Scholar
René Fahr
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marius Protte .

Editor information

Editors and Affiliations

Saga University, Saga, Japan
Kohei Arai

Appendices

Appendix A: Additional Tables and Graphics

Table 8. Distribution of study majors among participants

Full size table

Table 9. Summary statistics and between-group comparison of questionnaire items on treatment perception

Full size table

This table reports summary statistics of questionnaire items on treatment perception measured on a 5-point scale by treatment. Standard deviations reported in parenthesis. Kruskal-Wallis-H reports the p-values for Kruskal-Wallis H-Tests (KWH) with ties between experimental groups. Includes only subjects with high-accuracy AI.

Please Note.

The following Appendices, B and C, contain the experimental instructions and experimental questionnaire exactly in the form in which they were presented to the participants. Therefore, we abstain from numbering the contained figures and tables as well as referring them in the paper individually. Instead, Appendices B and C are referenced in their entirety.

Appendix B: Experimental Instructions (Active Involvement Group Exemplary)

Scenario

Over the course of the experiment, you assume the role of a skilled worker in an industrial company. You will be responsible for the operation of a production facility.
There is a certain probability that a malfunction may occur in the production facility. To avoid potential malfunctions, maintenance can be performed.
Your task is to evaluate the probability of malfunctions in multiple rounds and then decide whether the production plant should be maintained in a given round. You will be supported in your task by an artificial intelligence (AI).

Malfunction Probability

The probability of a malfunction is unknown but can be estimated using three indicators.
These indicators are: Temperature, Speed and Voltage. Each of the indicators can take values between 0 and 100.
Each of the indicators has its own optimal range. If an indicator is in its optimal range, this is particularly good for the production plant corresponds to a malfunction being less likely.
The more indicators’ values are located outside their respective optimum ranges, the more likely a malfunction becomes.

If all three indicators are within their optimal ranges and none are outside, a malfunction is very UNlikely.
If two indicators are within their optimal ranges and one is outside, a malfunction is UNlikely.
If one indicator is within their optimal range and two are outside, a malfunction is likely.
If all three indicators are outside their optimal ranges and none inside, a malfunction is very likely.

Example: The graphic above displays an example for the optimal ranges (green bars) for the three indicators (orange dots). In this example, the “Temperature” indicator is located outside its optimal range and the “Speed” and “Voltage” indicators are located inside their respective optimal ranges. Accordingly, a malfunction would be considered unlikely in this case.

Important: In the experiment, you do not know the optimal ranges. Instead, you must estimate them as accurately as possible, based on the data points of past malfunctions. This estimation is called “acceptable ranges” (see “Procedure”).

Support by an AI

In each round, the AI predicts the probability of a malfunction and, based on this prediction, gives you a non-binding recommendation as to whether maintenance should be performed.
The accuracy of the AI predictions can vary. It depends on how the AI has been trained. Training the AI is part of the experiment (see below). You will be informed about the achieved accuracy of the AI (in percent) in the experiment at the end of Stage 3.

Procedure

The experiment consists of four stages that build upon each other.

Stage 1: Comprehension Checks

In this stage, comprehension checks are conducted about the instructions. Only once you have answered all the control questions correctly the experiment can begin. You have an unlimited number of attempts to answer the questions correctly.

Stage 2: Selection of Acceptable Ranges

As mentioned, the optimal ranges of the individual indicators are unknown to you. Instead, you must define an acceptable range for each indicator.
An acceptable range is an approximation of the actual (unknown) optimal range. The closer the acceptable ranges you set are to the optimal range, the better the AI’s advice will be.
Data about past malfunctions is available for you to set your acceptable ranges:
- For each indicator individually, you can see at which values there were malfunctions in the past (red dots) and at which there were not (green dots).
- You are now asked to define a lower limit (minimum) and an upper limit (maximum) of your acceptable range (green dashes). In general, an acceptable range should contain as many points without malfunctions (green) and as few points with malfunctions (red) as possible.
- At the beginning, you will be given an example that you can use to practice setting the limits (technical note: the limit that is closer to your mouse pointer moves in each case).
- After you have set and confirmed the acceptable ranges, they will be displayed as gray bars in the further course of the experiment for the sake of conciseness (see figure). The individual data points are hidden.

You define a total of three acceptable ranges (one for each indicator), which you will necessarily need in the further course of the experiment.
Your acceptable ranges will be displayed for all further decisions, so you do not have memorize or note them.

Stage 3: Training the AI

In this stage, the AI is trained based on your acceptable ranges defined in Stage 2. The AI thus learns how to evaluate the probability of malfunctions for different indicator combinations.
The training of the AI happens through ten training situations as follows:
- Each training situation represents a combination of the three indicators’ values. These values are shown together with their acceptable ranges.
- The following figure provides an example of a training situation. Orange bars represent the indicators’ values. Gray bars represent the acceptable ranges.
- For each training situation, you can see which indicators are within and which are outside your defined acceptable ranges.
- Your task is to tell the AI how each training situation is to be evaluated regarding the likelihood of a malfunction. In doing so, you help the AI learn.
- Use your acceptable ranges and your knowledge about the probability of malfunctions for the evaluation:
  - If three indicators are within your acceptable ranges and zero are outside, a malfunction is very UNlikely.
  - If two indicators are within your acceptable ranges and one is outside, a malfunction is UNlikely.
  - If one indicator is within your acceptable ranges and two are outside, a malfunction is likely.
  - If there are zero indicators inside your acceptable ranges and three outside, a malfunction is very likely.
- Each of your ten malfunction probability assessments is then checked for correctness:
  - If a malfunction has been classified as “very unlikely” or “unlikely” and no malfunction has actually occurred, the assessment is considered correct and otherwise incorrect.
  - If a malfunction was classified as “very likely” or “likely” and a malfunction actually
  - occurred, the evaluation is considered correct and otherwise incorrect.
  - Whether a malfunction actually occurs or not depends on the actual optimum ranges, which remain unknown.

The results of training, and thus the quality of the AI, depends on how many training situations have been correctly assessed. You will be informed about the result at the end of the training stage. Two results are possible:
- If at least seven training situations were evaluated correctly, you will receive an AI with the accuracy of 90% (on average it is correct in 9 out of 10 cases and wrong in one out of 10 cases).
- If less than seven training situations were evaluated correctly, you will receive an AI with the accuracy of 50% (it is correct on average in 5 out of 10 cases and wrong in 5 out of 10 cases).
- After completing this stage, the AI has learned to evaluate malfunction probabilities in comparable situations through the training situations.
- In stage 4, you can use the AI for decision support.

Stage 4: Production Plant Surveillance

This stage consists of 25 rounds.
In each round you have to make the decision whether to maintain the production facility.
All rounds are independent of each other, i.e., the decision in one round does not affect other rounds.
In each round, you will receive a graphic showing the values of the three indicators and your self-defined acceptable ranges (see Stage 3).
In each round, you make your decision in two steps:
- In the first step, you evaluate the given situation in terms of the probability of failure and decide whether maintenance should be performed.
- In the second step, the AI’s recommendation is displayed to you. Afterwards, you are asked again whether you want to perform maintenance.
Only the decision in the second step is relevant for your payoff in the respective round.
Whether a malfunction actually occurs or not depends on the optimal ranges, which remain unknown. You will only find out at the end of the experiment how often you were correct and how high your payoff will be.

Payoffs

During the experiment, all amounts are denoted in the fictitious currency “Taler”.
Per round, depending on your maintenance decision and the occurrence/non-occurrence of a malfunction, you will receive the following payoffs:
You decide that maintenance should be performed.
- Maintenance limits your production capacities. Therefore, your payoff this round is 5 Taler.
You decide that no maintenance should be performed.
- If no malfunction occurs and you can therefore produce fully, your payoff from this round is 10 Taler.
- If a malfunction occurs and therefore you cannot produce, your payoff from this round is 0 Taler.
The payoffs from all rounds are cumulated.
At the end of the experiment, you will receive your payoffs at an exchange rate of 1€ per 10 Taler. In addition, you will receive a show-up fee of 2.50 €.

Additional Remarks

All communication is prohibited for the duration of the experiment except for communication explicitly permitted by the instructions.
Mobile phones must be turned off for the duration of the experiment.
All decisions within the scope of the experiment will remain completely anonymous.
After completing the main part of the experiment, we kindly ask you to answer some additional questions. Answering the questions honestly and in full is very important for the subsequent analysis of the experiment. The answers to the questions remain anonymous and will only be evaluated for scientific purposes. Your answers in this questionnaire have no impact on your payoff achieved in the experiment.

Appendix C: Questionnaire

Please answer the following questions.

What is your age?

What is your gender?

Male
Female
Non-Binary

What is your highest level of education?

Highschool/GED
Undergraduate degree
Graduate degree
Else/Prefer not to say

What is your current study major?

Please answer the following questions

Please indicate your consent with the following statement on a scale from 1 (= completely disagree) to 7 (= completely agree).

Please indicate your consent with the following statement on a scale from 1 (= Do not consent at all) to 5 (= Fully consent).