The dataset contains information about various models of mobile phones from different brands.
Below are the available variables:
1. Import the dataframe in R and with the use of dplyr, filter it to include prices >=22 and the Brands among the following: (“NOKIA”,“APPLE”,“SAMSUNG”,“LENOVO”,“HUAWEI”,“ACER”,“ALCATEL”,“AMAZON”,“ASUS”,“BlACKBERRY”,“SONY”,“GOOGLE”,“HTC”,“LG”,“MOTOROLA”). Use data cleaning tools to adapt the Brand Name variable to the above mentioned requirements before the filtering
2. Using dplyr count the number of cellphones by each brand in descending order.
3. Create 2 bar plots: 1) showing the average rating and 2) average price for each brand. Arrange both plots in an ascending order.
4. Using dplyr and ggplot2’s faceting, plot the distribution of ratings (with bars) for brands Apple, Samsung, HTC and Lenovo.
5. Show the distribution of Reviews’ length (number of characters). How long reviews the users write about the mobile phones? (Zoom in with xlim() and change binwidth to clearly see the plot)
6. Using tm package create a Term Document Matrix for Reviews of 2 popular phones (Apple and Samsung). Because the data is huge, subset it to include only the first 10,000 reviews for each brand. Make sure to remove any number, punctuation, stopwords and stem the review. Construct 2 bar plot to show top 10 words people used in their reviews about these phones.
7. Construct a word cloud of Apple’s reviews.
8. Solve Q7. using TFIDF weighting. Compare and make comments about how different are the 2 methods used.
9. Create a comparison-cloud for Apple’s reviews vs. Samsung and identify the words that are dissimilar.
10. Perform sentiment analysis on cell-phone reviews and find average polarity of each brand (not only Samsung and Apple, but also the ones included in Q1). Run it on a subset of 1000 reviews and show the brands with the highest polarity score. Make sure to remove non-ASCII characters to ASCII prior to the analysis.