Instructions: identify a topic you are interested in, and develop a problem/ques

June 27, 2024

Instructions:
identify a topic you are interested in, and develop a problem/question or a set of related problems or questions.
Conduct a literature review to find previous studies, their dataset, text-mining techniques, and findings relevant to your problem domain.
Find the dataset to solve the problem(s) or answer the question(s). Make sure your dataset has at least one text column or attribute as
demonstrated in class. Preferably in the context of social media. It is okay to have only the text column. Avoid having too much or too little data. Too much data (e.g., more than 100,000 rows) takes a long time to process and optimize. Too little data (e.g., less than 100 rows) may result in model overfit or insufficient data to drive statistically significant insights. You can find the links to various data sources on the syllabus.
Clean the dataset. Handle missing/null, erroneous, and outlier data, transform and remove records or columns (attributes) you don’t need. After cleaning, your dataset should have at least 1000 rows of records. 
Provide descriptive statistics and visualization in the notebook. There is no limit on the number of statistical measurements or visualizations, but they should help to describe your data and find potential relationships, trends, and insights that can guide your next steps. Consider average token counts, sentence counts, and POS, and how they are correlated to your target variable if there is one.
Choose at least two modules (e.g. classification, clustering, topic modeling, sentiment analysis, advanced approaches) that you learned from the course and apply them in your project. If you are performing classification, then identify potential dependent and independent variables from your cleaned dataset, develop a hypothesis, and evaluate model performance. Try different algorithms and compare their performances with each other and with the baseline. Make sure to choose appropriate methods or algorithms that fit your data.
Discuss and interpret your key results in the context of your problem domain. Add valuable and unique insights and recommendations based on your analysis.
You can borrow or customize code sections from the demo in class or online or use AI, but do not copy the entire solution—unlike in the individual case study, where you implement others’ code.
Provide adequate annotations in the notebook explaining your code.
Section 1: Introduction. Introduce the topic and background, describe the problem(s)/question(s), and why they are important and valuable, reasonably complex and new yet practical and realistic. Use references to academic articles if needed.
Section 2: Literature review. Use references from at least five conference or journal papers in text mining published within the last ten years. Write a literature review by synthesizing the articles you’ve found to provide the current state of text-mining research in your problem domain. Based on the collective findings, it should highlight two to three common topics/themes, divergent viewpoints, and gaps in the existing literature. Use in-text citations.
Here are some reliable sources where you can find high-quality academic articles:
ResearchGate.netLinks to an external site. – you can find many open-access full papers available, and in the references section at the bottom of a paper, you can find many other papers on similar topics.
Google ScholarLinks to an external site. – You can filter results to the last ten years and only journal or conference papers.
Section 3: Dataset
Describe the original dataset and list the data source/links.
Describe the data collection and data cleaning process, including how you handled missing, erroneous, and outlier records, any transformation, what is removed from the original dataset, and the reason.
Section 4: Descriptive statistics and visualization
Section 5: Data analysis and Discussion, including text preprocessing, model construction, and performance evaluation. Compare results from different algorithms.
Section 6: Insights and recommendations based on the above analysis
Section 7: Conclusion, including summary, limitations, and future study
Section 8: References used in the introduction and literature review
Please do the first 3 sections; once they are approved, then continue. Please use Google Colab to do the work.

Are you struggling with this assignment?

Our team of qualified writers will write an original paper for you. Good grades guaranteed! Complete paper delivered to straight to your email.

GET HELP WITH YOUR PAPER