Attention is all you need A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, AN Gomez, ... Advances in neural information processing systems 30, 2017 | 98974 | 2017 |
Exploring the limits of transfer learning with a unified text-to-text transformer C Raffel, N Shazeer, A Roberts, K Lee, S Narang, M Matena, Y Zhou, W Li, ... The Journal of Machine Learning Research 21 (1), 5485-5551, 2020 | 11482 | 2020 |
Palm: Scaling language modeling with pathways A Chowdhery, S Narang, J Devlin, M Bosma, G Mishra, A Roberts, ... arXiv preprint arXiv:2204.02311, 2022 | 2112 | 2022 |
Scheduled sampling for sequence prediction with recurrent neural networks S Bengio, O Vinyals, N Jaitly, N Shazeer Advances in neural information processing systems 28, 2015 | 2075 | 2015 |
Image transformer N Parmar, A Vaswani, J Uszkoreit, L Kaiser, N Shazeer, A Ku, D Tran International conference on machine learning, 4055-4064, 2018 | 1618 | 2018 |
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer N Shazeer, A Mirhoseini, K Maziarz, A Davis, Q Le, G Hinton, J Dean arXiv preprint arXiv:1701.06538, 2017 | 1509 | 2017 |
Exploring the limits of language modeling R Jozefowicz, O Vinyals, M Schuster, N Shazeer, Y Wu arXiv preprint arXiv:1602.02410, 2016 | 1291 | 2016 |
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity W Fedus, B Zoph, N Shazeer The Journal of Machine Learning Research 23 (1), 5232-5270, 2022 | 1002 | 2022 |
Attention is all you need. arXiv 2017 A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, AN Gomez, ... arXiv preprint arXiv:1706.03762 30, 2017 | 849 | 2017 |
Generating wikipedia by summarizing long sequences PJ Liu, M Saleh, E Pot, B Goodrich, R Sepassi, L Kaiser, N Shazeer arXiv preprint arXiv:1801.10198, 2018 | 823 | 2018 |
Lamda: Language models for dialog applications R Thoppilan, D De Freitas, J Hall, N Shazeer, A Kulshreshtha, HT Cheng, ... arXiv preprint arXiv:2201.08239, 2022 | 754 | 2022 |
End-to-end text-dependent speaker verification G Heigold, I Moreno, S Bengio, N Shazeer 2016 IEEE International Conference on Acoustics, Speech and Signal …, 2016 | 725 | 2016 |
Music transformer CZA Huang, A Vaswani, J Uszkoreit, N Shazeer, I Simon, C Hawthorne, ... arXiv preprint arXiv:1809.04281, 2018 | 715 | 2018 |
Adafactor: Adaptive learning rates with sublinear memory cost N Shazeer, M Stern International Conference on Machine Learning, 4596-4604, 2018 | 663 | 2018 |
Advances in neural information processing systems A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, AN Gomez, ... Attention is All you Need, 2017 | 653 | 2017 |
undefinedukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, AN Gomez Proceedings of the 31st International Conference on Neural Information …, 0 | 638 | |
How much knowledge can you pack into the parameters of a language model? A Roberts, C Raffel, N Shazeer arXiv preprint arXiv:2002.08910, 2020 | 585 | 2020 |
Tensor2tensor for neural machine translation A Vaswani, S Bengio, E Brevdo, F Chollet, AN Gomez, S Gouws, L Jones, ... arXiv preprint arXiv:1803.07416, 2018 | 581 | 2018 |
Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017 V Ashish, S Noam, P Niki, U Jakob, J Llion Attention is all you need. In Advances in neural information processing …, 2017 | 554 | 2017 |
Gshard: Scaling giant models with conditional computation and automatic sharding D Lepikhin, HJ Lee, Y Xu, D Chen, O Firat, Y Huang, M Krikun, N Shazeer, ... arXiv preprint arXiv:2006.16668, 2020 | 532 | 2020 |