Understanding sentiment of national park visitors from social media data

1. National parks are key for conserving biodiversity and supporting people's well-being. However, anthropogenic pressures challenge the existence of national parks and their conservation effectiveness. Therefore, it is crucial to assess how people perceive national parks in order to enhance socio-political support for conservation. 2. User-generated data shared by visitors on social media provide opportunities to understand how people perceive (e.g. preferences, feelings, opinions) national parks during nature-based recreational experiences. In this study, we applied methods from automated natural language processing to assess visitors' senti ment when describing experiences in Instagram posts geolocated inside four na tional parks in South Africa. 3. We found that


| INTRODUC TI ON
In the Anthropocene, human activities are dramatically transforming the biosphere, causing an unprecedented loss of biodiversity (Dirzo et al., 2014). Together with a broader system of designated protected areas, national parks are key policy instruments for protecting wide ecological structure and processes necessary to conserve biodiversity (Watson, Dudley, Segan, & Hockings, 2014). However, the needs of a growing human population pose financial, economic and political pressures on national parks, challenging their existence and effective conservation outcomes (Chan et al., 2007). These include, for example, high management costs and lack of resources, pressure to explore other land use options , poor governance (Eklund & Cabeza, 2017) and pressure from the downgrading, downsizing and degazettement of protected areas (Golden Kroner et al., 2019).
National parks also play a key societal role (Dudley, 2008).
Historically, they were firstly designated to conserve a country's natural heritage (Gissibl, Höhler, & Kupper, 2012). Today, this scope was broadened to engage a diverse range of political, cultural, economic and ecological values, including biogeography representativeness and resource management that national parks entail (Gissibl et al., 2012;Watson et al., 2014). In particular, access to recreation, education and other non-material benefits people obtain from cultural ecosystem services (Ament, Moore, Herbst, & Cumming, 2016;Millennium Ecosystem Assessment, 2005) are key aspects defining the primary socio-ecological objectives of national parks (Dudley, 2008;Eagles & McCool, 2002). Therefore, understanding how people perceive national parks during nature-based recreational experiences is important in order to help enhance physical and psychological benefits to visitors, and foster long-term socio-political support for conservation (McCool, 2006).
Interaction with nature elicits positive emotions (Maller et al., 2010), for example, by improving physical, mental and psychological health, reducing stress, increasing physical recovery from illnesses (Velarde, Fry, & Tveit, 2007) and promoting social integration (Abraham, Sommerhalder, & Abel, 2010) and well-being (Puhakka, Pitkänen, & Siikamäki, 2017). These feelings help develop a positive attachment to nature, which can elicit benefits from sense of place, while promoting people's pro-environmental behaviour and support for conservation (Hausmann, Slotow, Burns, & Di Minin, 2016). In contrast, dissatisfaction towards experiences or services and expectations may bring out adverse sentiment towards national parks, disrupting attachment and intention for future visits (Kil, Holland, Stein, & Ko, 2012). For example, controversial management, such as species population control (Gusset et al., 2008) and anti-poaching activities (Lubbe, du Preez, Douglas, & Fairer-Wessels, 2019), may potentially result in lack of support to management, opposition to conservation initiatives and conflict (Chan et al., 2007;Gusset et al., 2008).
As a consequence of lack of interest or alienation from nature (e.g. less emotional affinity due to the extinction of experience, Soga et al., 2016), people may also have no emotional response towards national parks, which may result in the lack of support and engagement for biodiversity conservation (Zhang, Goodale, & Chen, 2014).
In order to assess visitors' attitudes towards national parks, managers have been traditionally using surveys, such as feedback questionnaires (e.g. Boshoff, Landman, Kerley, & Bradfield, 2007;Puhakka et al., 2017). However, surveys are generally costly to implement, time-consuming and limited in space and time by providing only a snapshot of the situation, while resources available to managers are limited (McCarthy et al., 2012). On the other hand, social media platforms have become popular means, among national park visitors, for sharing experiences through photos, videos and text (Hausmann et al., 2018). Social media data may provide a cost-effective, widespread and real-time source of information which can be used to understand human-nature interactions (Di Minin, Tenkanen, & Toivonen, 2015;Toivonen et al., 2019), including landscape values (van Zanten et al., 2016), preferences for biodiversity (Hausmann et al., 2018;Willemen, Cottam, Drakou, & Burgess, 2015), and visitation to nature-based destinations Tenkanen et al., 2017). In addition, compared to surveys, data from social media may reveal how the destination image (i.e. what people know, how do they feel and act in relation to a place; Tasci, Gartner, & Cavusgil, 2007) is represented and constructed in the virtual social environment, which may reflect a different or richer view from what is projected by traditional marketing productions (Hunter, 2016). This is a topic of growing importance in tourism research (Govers & Go, 2008), as digital platforms, including social media, are increasingly playing a significant role in shaping public reputations of places, travellers' behaviour and choices (Zeng & Gerritsen, 2014), and visitors' expectations for a satisfactory experience (Hunter, 2016). However, it is still an unexplored aspect of tourists' visitation in national parks. Novel methods from natural language processing allow to systematically assess the emotional tone and content of digital written language (Barnes, Klinger, Schulte, & Walde, 2017;Ribeiro, Araújo, Gonçalves, André Gonçalves, & Benevenuto, 2016). Among these methods, sentiment analysis has been advocated as a novel way to assess public opinions towards conservation issues (Drijfhout, Kendal, Vohl, & Green, 2016;Ladle et al., 2016), yet its application in conservation science is still widely unexplored. Existing case studies have used Twitter data to investigate human sentiment on the Great Barrier Reef (Becken, Stantic, Chen, Alaei, & Connolly, 2017) or for tracking public opinion towards conservation-related topics over time (Fink, Hausmann, & Di Minin, 2020). However, to our knowledge, no previous study assessed social media users' perceptions and emotions when visiting national parks.
In this study, we assessed the sentiment and the discourse shared by visitors on social media to understand how they perceive national parks, and what they value during recreational experiences. In order to do this, we analysed the content of picture captions in Instagram posts which were geolocated from inside the borders of Kruger, Addo Elephant, Table Mountain and Garden Route National Parks (NPs), in South Africa, between 2013 and 2016. In particular, we used natural language processing and sentiment analysis approaches to assess (a) what is the sentiment and what are the main emotional components attached to social media posts? and (b) how visitors describe experiences in social media posts across and within parks?

| Study areas
South Africa is well known for its environmental and wildlife conservation efforts, and is known for hosting some of the most iconic national parks world-wide (Carruthers, 2017). However, many national parks share the controversial socio-political history of the country, having originated during the colonial time and having experienced alienation from local people (Carruthers, 2017). Today, the management plans recognize the vision for 'a sustainable national parks system, connecting society' in the way that 'national parks will be the pride and joy of all South Africans' (SANParks, 2006(SANParks, , 2017.
Understanding how visitors perceive national parks, and what they value during their visit, is therefore a key aspect needed to fulfil this management vision and promote national parks' role into society.
The study focuses on Kruger NP, Addo Elephant NP, Table   Mountain NP and Garden Route NP ( Figure S1, Appendix A). Table   Mountain NP and Kruger NP are the most popular parks in the country having received, between 2016 and 2017, respectively almost 3.5 and 2 million visitors, while Garden Route NP and Addo Elephant NP received almost 500,000 and 270,000 visitors respectively (SANParks, 2017). These parks were chosen as they are the most visited in South Africa and because social media data were found to match official visitations' statistics . In addition, according to previous survey conducted in Kruger NP and The opportunity to spot these species in the wild make these parks popular destinations for wildlife watching activities, including self or guided game drives, which are part of a 'safari' experience (Boshoff et al., 2007;Grünewald, Schleuning, & Böhning-Gaese, 2016). In addition, the parks are located in different biomes offering a variety of landscape experiences and aesthetic cultural services. While Kruger NP is located in the Savanna and Thickets biomes, Addo Elephant NP is located in the Fynbos, Forest, Nama-Karoo and the Indian Ocean Coastal Belt biomes (Mucina & Rutherford, 2006). Access to both parks is regulated by official gates and borders are fenced. On the other hand, Table Mountain NP (covering 243 km 2 ) and Garden Route NP (covering 1,570 km 2 ) are popular destinations for broader nature-based experiences, including for less-charismatic biodiversity and other outdoor activities, such as hiking, nature walks, mountain biking and water activities (Barendse et al., 2016;Hausmann, Slotow, Fraser, & Di Minin, 2017). They are both located in the Fynbos biome (Mucina & Rutherford, 2006) and access to the parks is open as they are mostly unfenced.

| Social media data collection and processing
Instagram, with up to 1 billion active users (Chaffey, 2018), is among the most popular platforms world-wide, and one of the most used to share nature-based experiences in South Africa's national parks (Hausmann et al., 2018). We used Instagram's public Application Programming Interface (API) for accessing 1 week/month sample of posts, which were geolocated within the national parks borders, between June 2013 and February 2016. Only publicly available posts were accessed and users were de-identified.
In order to extract relevant information from large volume of social media posts, we used methods from natural language processing, which allowed us to perform automated text mining and turn unstructured textual data into structured data, which can be used for quantitative analysis (Nadkarni, Ohno-Machado, & Chapman, 2011).
In particular, in r (R Development Core Team, 2013), we used the packages tm (Feinerer & Hornik, 2018) and tidytext (Silge & Robinson, 2016) for cleaning the text by keeping only alpha-numeric characters and removing irrelevant features, such as links and usernames embedded in the text, all special characters and English stop words based on a pre-defined list of most common words in English, for example, pronouns. An additional set of words (Table S1, Appendix A), which carried obvious information, such as name of the country, regions and parks or that were not related to the content shared, such as highly used hashtags (#instagood, #nofilter), were also removed. Hashtags, a popular way of communicating in short text language of social media, were included in the analysis as they may carry sentiment and emotional meaning (e.g. #love, #happy; Mohammad, Kiritchenko, & Zhu, 2013). Hashtags created from combined words (e.g. #wildlifephotography) were kept as single words. Then, we used the fastText framework in Python 3.7.2 (Joulin, Grave, Bojanowski, & Mikolov, 2017) for identifying posts in English language, as this is among the main official languages used in South Africa and the most popular among internet users world-wide (https://www.stati sta.com/stati stics/ 26294 6/share -of-the-most-commo n-langu ageson-the-inter net/), and discarded the rest from the analysis.

| Sentiment analysis
Sentiment analysis is a natural language processing method, increasingly popular in several field of research (see e.g. Mäntylä, Graziotin, & Kuutila, 2018), which allows to automatically analyse opinions and subjectivity expressed in online speech, including personal feelings, beliefs and judgement. Several methods, like for example supervised lexicon-based classification or unsupervised machine learning, have been developed to identify positive, negative or, in case no subjectivity is expressed, neutral sentiment polarity in the text. These methods showed varying prediction performances across different research domains, such as computer science, marketing and psychology (see e.g. Alaei, Becken, & Stantic, 2019;Ribeiro et al., 2016). As no specific method has been identified as best for nature-based tourism yet (Alaei et al., 2019), we chose to apply a general lexicon-based approach where the text is classified based on pre-defined dictionary (lexicon) of words associated with sentiment polarity. In particular, we used the NRC Word-Emotion Lexicon (Mohammad & Turney, 2013), which is openly available from the NRC-Canada sentiment system, and accessible in the syuzhet package in r (Jockers, 2015). The lexicon contains a list of unigrams (i.e. words) in English language that have been manually annotated and validated through an online crowdsourcing system (see more details in Mohammad & Turney, 2013). We chose this dictionary because the NRC Word-Emotion Lexicon includes words annotated both for sentiment polarity (−1, negative, and +1, positive) and in association with eight classes of basic emotions, including anger, fear, anticipation, trust, surprise, sadness, joy and disgust. These classes allow for a more in depth understanding of the emotional components driving the sentiment in the text (Naldi, 2019).  Table S2, Appendix A). Specifically, the 'sentiment value' of each post was calculated as the number of positive words minus the number of negative words, as annotated in the dictionary. Accordingly, the sentiment polarity of a single post was determined as follows: if the score is >0, the post has an overall 'positive' sentiment, if the score is <0, the post has an overall 'negative' sentiment, if the score = 0, the post is considered to be 'neutral'. Score in each emotion class were calculated as the sum of words assigned to each class according to the dictionary. Words that were not present in the lexicon, such as hashtags from combined words (e.g. #hikingDay, #BigFive), or that carried no sentiment or emotional tone, were annotated as 0. We then calculated average and variance values of sentiment and emotion classes overall across parks and within each park. We used ANOVA (α = 0.05) and Tukey post hoc tests to assess statistical differences in average sentiment values, and in combined emotional values, across parks. We did this in order to assess whether visitors perceived parks differently and were more likely to express stronger emotional tone (i.e. higher average values of emotion classes) in social media posts.
In order to assess the accuracy of the chosen automated classification, we manually and independently annotated a random sample of 4,500 posts into three classes of sentiment polarity.
Then, we compared manual annotation with predicted classes (divided into positive > 1, neutral = 0 and negative < 1), by calculating F1-score measures, which is a weighted mean of 'precision' and 'recall' (Ribeiro et al., 2016). Precision is the ratio of the number of correctly predicted posts in a class with respect to the total number of posts predicted in the same class, including false positive. Recall is the ratio of number of correctly predicted posts in a class with respect to the total number of posts which should have been predicted in the same class, including false negative.
Finally, we used a Chi-square (χ 2 ) test on a balanced sample, which was randomly selected within each park (n = 1,000 posts), in order to assess whether proportions of sentiment polarity classes, and emotion classes, in each park, differed from average values across all parks.

| Content analysis
To assess the content of the discourse shared on social media text, we firstly extracted the most frequent unigrams, including single words or hashtags, across all parks and within each park. Frequencies were normalized according to the total amount of words in each park in order to be compared. To measure the diversity of the vocabulary used in each park to describe experiences, we used Simpson's λ Index. To calculate the index we considered unigrams as features and their frequencies as abundance measures. In addition, in order to assess whether visitors describe experiences by using a similar language across different types of parks, we assessed the correlation of frequencies of same unigrams between parks by using the Sperman's rank test. To do this, we only considered those unigrams used at least once within all parks. Moreover, we further explored content shared within each park, by extracting the most frequent compounds of two words or hashtags (bi-grams) in order to identify how words are used in combination. By extracting bi-grams, we were also able to detect names generated from two words, such as the iconic location of 'Nature's Valley'. In addition, in order to assess which aspects of the nature experience were frequently mentioned in relation with positive sentiment, we extracted the most frequent words occurring in posts classified with positive sentiment polarity for each park.
Thereafter, in each park, we further explored main topics of discussion on social media by using topic modelling (Hong & Davison, 2010). Specifically, we used Latent Dirichlet Allocation, an unsupervised algorithm which identifies a potential underlying structure in the data, by detecting 'bags' or groups of words without any specific order probabilistically associated with each other.
Compared to word frequency analysis, where words stand alone with a single, often literal, meaning, topic modelling allows to capture the broad meaning, or concept, described by the combination of words. For example, the combination of words 'beautiful', 'lion', and 'safari' may capture aspects used to describe a broader concept of 'wildlife experience'. In this sense, results from the Latent Dirichlet Allocation were used to identify the 'type of experience' described by visitors on social media. To do this, topics were labelled through interpreting the non-predefined themes emerging from the data (McAbee, Landis, & Burke, 2017). As Latent Dirichlet Allocation requires a pre-defined number of topics, we identified the optimal number of topics by assessing the rate of perplexity change as a function of numbers of topics (Zhao et al., 2015).
Perplexity is a common measurement used to evaluate how well a statistical model describes a dataset, such as the appropriate number of topics that can describe a text (Zhao et al., 2015). The optimal number is defined by the least number of topics maximizing the information covered as close to the original text as possible.
We identified this optimum by assessing the perplexity against an increasing number of assumed topics in the model (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30 and 40). It was seen that perplexity was lowest when the model was trained on a single topic, as can happen in certain practical scenarios when using very small documents (Bian et al., 2015). While a measure of perplexity can be a useful guiding principle, the best judgement is often achieved by human interpretations of resultant topics (Chang, Boyd-Graber, Gerrish, Wang, & Blei, 2009).   Table Mountain NP compared to the other parks ( Figure 1; Table S3 in Appendix A).

| Content analysis
Overall, 33,658 different unigrams, including 17,943 words and 15,715 hashtags, were used in social media to describe the parks.
Hashtags amounted for 64% of words per post on average. After processing the dataset, the size of the vocabulary for each of the parks was 11,188 in Table Mountain  However, specific unigrams were used differently between parks ( Figure 4). Some unigrams occurred in only one park, such as the word 'penguins' which occurred only in Table Mountain   Across all parks, the best number of modelled latent topics of discussion was one (Table S6, Appendix A), showing that 'nature' and its meanings appear to be the predominant topic to describe experiences.
Topic modelling revealed a combination of words which were either common to all parks or unique to each park (Table 2). Words common to all parks were related to nature experiences, such as 'nature', 'wilderness' or 'sunset'; activities, including 'travel', 'hiking' or 'safari'; and positive emotions, such as 'beautiful' or 'love'. Topics also included words referring to park-specific features. These were related to species, such as 'penguins', 'lion' or 'elephant'; other nature attractions, including 'Storms river', 'ocean' or 'beach'; and iconic places, such as 'Cape point' or 'Tsitsikamma'. Specifically, Addo Elephant NP and Kruger NP topics were labelled as 'wildlife-experience parks', as words referred mostly to species common names (e.g. elephant, lion) and wildlife-watching

| D ISCUSS I ON
This study provides an assessment of visitors' attitudes and perceptions of national parks, by using automatic natural language processing of textual content shared on social media during visitation.
Overall, we found that the polarity of visitors' sentiment on social media was positive, and was mostly expressing emotions such as joy, anticipation, trust and surprise, with only a small occurrence of posts with negative feelings. In particular, the most frequent aspect used to describe experiences across all parks was appreciation of nature and its components, including species, landscapes, beach, ocean and images and ideas of nature. These findings support and highlight the societal role of national parks in providing visitors with opportunities to develop positive connections with nature (Russell et al., 2013), which can then generate physical, psychological and social benefits (Ament et al., 2016;Hausmann et al., 2016;Puhakka et al., 2017). In addition, our study reveals that user-generated content shared on social media may help  of high-quality tourism experiences, which could foster socio-political support for national parks and their long-term conservation effectiveness (McCool, 2006). Analysing visitors' posts may also help detect potential threats to biodiversity, which might be represented and self-reinforced on social media as symbols of a positive experience. These may include taking close-up selfies with wildlife, or identifying emerging tourist hotspots in potentially sensitive areas ). An early detection of these threats can help inform the design of targeted interventions, such as awareness campaigns, which may promote positive visitor experiences in line with biodiversity conservation objectives.
The type of nature-based experience described by visitors varies according to park-specific context. We found that, although the language used to describe experiences was highly diverse across all parks, similar words were used in parks with similar character- We found that charismatic species (e.g. elephant, lion, zebra), in relation to wildlife watching activities (e.g. game driving, safari), were indeed part of most frequent topics shared on social media in parks where the species occur. However, the presence of potentially dangerous animals, and restrictions on some activities for security reasons, such as in the case of independent walking, may explain higher negative emotions, such as fear. On the other hand, parks offering broader nature-based experiences where outdoor activities, such as hiking, are allowed to elicite higher positive feelings in social media users. Nature-based tourism markets in sub-Saharan Africa are not limited to charismatic species Hausmann, Toivonen, et al., 2017) and these conservation areas can be promoted for the well-being feelings they generate.
While social media data may provide important insights in understanding human-nature interactions (Di Minin et al., 2015), the nature of the data involves limitations, including biases related to geographical, population and visitation representativeness. These include, for instance, that posts may be incorrectly geotagged , that social media is mostly used among younger people (Hausmann et al., 2018) and that social media data are a better proxy of tourists' visitation in more popular parks  In conclusion, our study complements previous research on stakeholders' attitudes towards protected areas (Bragagnolo, Malhado, Jepson, & Ladle, 2016) by revealing that social media data may provide opportunities for understanding how people perceive national parks, including identifying the context-specific aspects sought by visitors.
The approach and methods used in this study can be used elsewhere by conservation scientists and managers to understand the online image of national parks constructed by visitors. This could help inform decision-making for enhancing the social value of the parks and to build political support to justify their existence (Chan et al., 2007). However, social media data do not always cover all stakeholders' views, such as people without internet access or not using the platforms. In order to generate a comprehensive understanding of the social impact of national parks, it is important to integrate attitudes of different stakeholders both in the virtual and the real environment, into assessments of management effectiveness. Future studies may help to further investigate the role of social media data to understand the views of various stakeholders, such as both visitors and people living within and in the surroundings of national parks, and which factors may be driving positive sentiment. Moreover, new analytical methods, and the development of computationally efficient algorithms, will provide emerging opportunities to synthesize the growing amount of digital data to the relevant information (Gandomi & Haider, 2015) useful for conservation decision-making . Sentiment analysis, and content of social media, can be further explored to inform conservation science and practice (Drijfhout et al., 2016;Toivonen et al., 2019), including understanding and monitoring people's reactions towards events related to biodiversity conservation (Fink et al., 2020), or controversial topics (e.g. culling, recreational hunting). ). Finally, we would like to thank the Editors, P. Jepson and A. Schwartz for comments that helped us improve our manuscript.

CO N FLI C T O F I NTE R E S T
Nothing to declare.

AUTH O R S ' CO NTR I B UTI O N S
A

DATA AVA I L A B I L I T Y S TAT E M E N T
r packages and codes used for natural language processing (https://www.tidyt extmi ning.com/index.html) and sentiment analysis