This dataset was gotten from this link: https://www.kaggle.com/datasets/ehtishamsadiq/uncleaned-laptop-price-dataset/data
First we will import it from github and remove the blank rows.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(ggplot2)
Laptops = read.csv('https://raw.githubusercontent.com/bwolin99/TestRepo/refs/heads/main/Project2/laptopData.csv')
Laptops = Laptops %>% na.omit()
Next we want to alter the CPU column to just display the amount of giga hertz that each CPU is capable of producing.
Laptops$Cpu = str_extract(Laptops$Cpu, "(.)(.)(.)(.)(.)(GHz)" )
Laptops$Cpu = str_extract(Laptops$Cpu, "(?= )(.*)(GHz)" )
Laptops$Cpu = gsub("V6","",Laptops$Cpu)
Laptops$Cpu = as.numeric(gsub("[^0-9.<>]", "", Laptops$Cpu))
Now we can use boxplots to compare the different processing speeds for different types of laptops.
ggplot(Laptops, aes(x=TypeName, y=Cpu)) +
geom_boxplot()
From this we can see that gaming devices and worksations have the best processing speed on average. While Netbooks are far behind the norm when it comes to processing speed. This means that if you want a computer with faster processing times a gaming or workstation laptop is the way to go.
Dataset Link: https://databank.worldbank.org/source/world-development-indicators/preview/on#
This dataset contains many indicators for growth in a country, specifically looking at the G20.
Indications = read.csv('https://raw.githubusercontent.com/bwolin99/TestRepo/refs/heads/main/Project2/DevelopmentIndicators.csv')
head(Indications)
## Country.Name Country.Code
## 1 Argentina ARG
## 2 Argentina ARG
## 3 Argentina ARG
## 4 Argentina ARG
## 5 Argentina ARG
## 6 Argentina ARG
## Series.Name
## 1 Adolescent fertility rate (births per 1,000 women ages 15-19)
## 2 Agriculture, forestry, and fishing, value added (% of GDP)
## 3 Annual freshwater withdrawals, total (% of internal resources)
## 4 Births attended by skilled health staff (% of total)
## 5 CO2 emissions (metric tons per capita)
## 6 Contraceptive prevalence, any method (% of married women ages 15-49)
## Series.Code X2014..YR2014. X2015..YR2015. X2016..YR2016. X2017..YR2017.
## 1 SP.ADO.TFRT 67.791 65.395 61.852 57.783
## 2 NV.AGR.TOTL.ZS 6.712703514 5.156685902 6.26456582 5.231622377
## 3 ER.H2O.FWTL.ZS 12.90753425 12.90753425 12.90753425 12.90753425
## 4 SH.STA.BRTC.ZS 99.6 99.6 98.4 93.9
## 5 EN.ATM.CO2E.PC 4.209111895 4.301913806 4.201815869 4.070111687
## 6 SP.DYN.CONU.ZS .. .. .. ..
## X2018..YR2018. X2019..YR2019. X2020..YR2020. X2021..YR2021. X2022..YR2022.
## 1 51.029 46.153 39.866 39.065 37.932
## 2 4.537878897 5.318555997 6.357033676 7.306308855 6.639898278
## 3 12.90753425 12.90753425 12.90753425 .. ..
## 4 99.5 99.6 98.8 .. ..
## 5 3.975650744 3.742029812 3.40561754 .. ..
## 6 .. .. 70.1 .. ..
## X2023..YR2023.
## 1 ..
## 2 6.059508763
## 3 ..
## 4 ..
## 5 ..
## 6 ..
We will be tranforming this data to compare countries in a specific year. To do this a dataframe displaying each countries GDP, GDP growth, and Population. We will then make columns for GDP per capita.
## First select only the necessary columns
Ind_Clean = data.frame(Indications$Country.Name, Indications$Series.Name, Indications$X2023..YR2023)
names(Ind_Clean) = c('Country','Questions','Amount')
## Then we will spread the data so we can make columns out of the indicating questions. Next we select the necessary columns
Ind_Clean = spread(Ind_Clean,Questions,Amount)
Ind_Final = data.frame(Ind_Clean$Country,Ind_Clean$`GDP (current US$)`,Ind_Clean$`GDP growth (annual %)`,Ind_Clean$`Population, total`)
names(Ind_Final) = c('Country','GDP','GDP_Growth','Population')
## Finally we create our calculated column and rank the countris based on that
Ind_Final$GDP = as.numeric(Ind_Final$GDP)
Ind_Final$Population = as.numeric(Ind_Final$Population)
Ind_Final = mutate(Ind_Final, GDP_per_cap = Ind_Final$GDP/Ind_Final$Population)
Ind_Final = Ind_Final %>% arrange(desc(GDP_per_cap))
Ind_Final
## Country GDP GDP_Growth Population GDP_per_cap
## 1 Switzerland 8.84940e+11 0.716066869 8849852 99994.893
## 2 United States 2.73609e+13 2.542700299 334914895 81695.083
## 3 Australia 1.72383e+12 3.016988104 26638544 64711.870
## 4 Netherlands 1.11812e+12 0.116008635 17879488 62536.466
## 5 Germany 4.45608e+12 -0.304934576 84482267 52745.744
## 6 United Kingdom 3.34003e+12 0.104017849 68350000 48866.569
## 7 France 3.03090e+12 0.703718547 68170228 44460.758
## 8 Italy 2.25485e+12 0.920692214 58761146 38373.145
## 9 Japan 4.21295e+12 1.923055531 124516650 33834.431
## 10 Korea, Rep. 1.71279e+12 1.356733243 51712619 33121.316
## 11 Spain 1.58069e+12 2.50329436 48373336 32676.886
## 12 Saudi Arabia 1.06758e+12 -0.754914811 36947025 28894.884
## 13 Russian Federation 2.02142e+12 3.6 143826130 14054.609
## 14 Mexico 1.78889e+12 3.228736756 128455567 13926.138
## 15 Argentina 6.40591e+11 -1.550501536 46654581 13730.506
## 16 Turkiye 1.10802e+12 4.516866508 85326000 12985.725
## 17 China 1.77948e+13 5.2 1410710000 12614.074
## 18 Brazil 2.17367e+12 2.908480487 216422446 10043.644
## 19 Indonesia 1.37117e+12 5.048105771 277534122 4940.546
## 20 India 3.54992e+12 7.583971124 1428627663 2484.846
From this tranformation we can see the countries with the smallest GDP per capita are Brazil, Indonesia, and India. The countries with the highest GDP per capita are Switzerland, the US, and Argentina.
For this data set we will take the 1’s and 0’s method of denoting a movies catagory and condense it into one column.
Movies = read.csv('https://raw.githubusercontent.com/hms-dbmi/UpSetR/refs/heads/master/inst/extdata/movies.csv',sep= ';')
## First we gather the genres to reduce the number if columns and turn their values into an Is_category metric. Then we can filter out all of the rows that are not a selected genre of the movie
Movies_Clean = gather(Movies,'Categroies','Is_Category',3:19)
Movies_Clean = Movies_Clean[Movies_Clean$Is_Category != 0,]
## Now we will select the desired columns and group the category columns
Genre_List = list()
for (i in Movies_Clean$Name){
genres = ''
df = Movies_Clean[Movies_Clean$Name == i,]
for (x in df$Categroies){
genres = paste(genres,x,sep = ", ")
}
Genre_List[[length(Genre_List)+1]] = genres
}
Movies_Clean$Genre = Genre_List
Movies_Clean$Genre = substring(Movies_Clean$Genre, 3)
## Next we create the final table and remove duplicates
Movies_Final = data.frame(Movies_Clean$Name, Movies_Clean$ReleaseDate, Movies_Clean$Genre,Movies_Clean$AvgRating,Movies_Clean$Watches)
names(Movies_Final) = c('Names','ReleaseYear','Genres','AvgRating','Watches')
Movies_Final = Movies_Final[!duplicated(Movies_Final$Names),]
Movies_Final = Movies_Final %>% arrange(desc(AvgRating))
head(Movies_Final,20)
## Names
## 1 Ulysses (Ulisse) (1954)
## 2 Follow the Bitch (1998)
## 3 Smashing Time (1967)
## 4 One Little Indian (1973)
## 5 Lured (1947)
## 6 Gate of Heavenly Peace, The (1995)
## 7 Bittersweet Motel (2000)
## 8 Schlafes Bruder (Brother of Sleep) (1995)
## 9 Song of Freedom (1936)
## 10 Baby, The (1973)
## 11 I Am Cuba (Soy Cuba/Ya Kuba) (1964)
## 12 Lamerica (1994)
## 13 Apple, The (Sib) (1998)
## 14 Sanjuro (1962)
## 15 Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)
## 16 Shawshank Redemption, The (1994)
## 17 Godfather, The (1972)
## 18 Close Shave, A (1995)
## 19 Usual Suspects, The (1995)
## 20 Wrong Trousers, The (1993)
## ReleaseYear Genres AvgRating Watches
## 1 1954 Adventure 5.00 1
## 2 1998 Comedy 5.00 1
## 3 1967 Comedy 5.00 2
## 4 1973 Comedy, Drama, Western 5.00 1
## 5 1947 Crime 5.00 1
## 6 1995 Documentary 5.00 3
## 7 2000 Documentary 5.00 1
## 8 1995 Drama 5.00 1
## 9 1936 Drama 5.00 1
## 10 1973 Horror 5.00 1
## 11 1964 Drama 4.80 5
## 12 1994 Drama 4.75 8
## 13 1998 Drama 4.67 9
## 14 1962 Action, Adventure 4.61 69
## 15 1954 Action, Drama 4.56 628
## 16 1994 Drama 4.55 2227
## 17 1972 Action, Crime, Drama 4.52 2223
## 18 1995 Comedy, Thriller 4.52 657
## 19 1995 Crime, Thriller 4.52 1783
## 20 1993 Comedy 4.51 882
Looking at the top 20 movies all of the top 10 have only 1 or 2 views. I’d say you can’t get any valuable information from that column unless it has at least 10 vies. This would make the highest rated movies with that criteria Sanjuro, Seven Samurai, The Shawshank Redemption, The Godfather, and A Close Shave.